Optimisation of Maintenance Policies Based on Right-Censored Failure Data Using a Semi-Markovian Approach

This paper exposes the existing problems for optimal industrial preventive maintenance intervals when decisions are made with right-censored data obtained from a network of sensors or other sources. A methodology based on the use of the z transform and a semi-Markovian approach is presented to solve these problems and obtain a much more consistent mathematical solution. This methodology is applied to a real case study of the maintenance of large marine engines of vessels dedicated to coastal surveillance in Spain to illustrate its usefulness. It is shown that the use of right-censored failure data significantly decreases the value of the optimal preventive interval calculated by the model. In addition, that optimal preventive interval increases as we consider older failure data. In sum, applying the proposed methodology, the maintenance manager can modify the preventive maintenance interval, obtaining a noticeable economic improvement. The results obtained are relevant, regardless of the number of data considered, provided that data are available with a duration of at least 75% of the value of the preventive interval.


Introduction
The main goal of this work is to present a methodology that allows finding the optimal maintenance preventive interval when the failure data are right-censored. In particular, we are interested in finding out how the use of right-censored data affects the calculation of the optimal maintenance interval, considering the temporary maintenance schedule of maintenance interventions and the different types of costs incurred. This study is relevant, since these are the types of data that are generally available in the vast majority of companies, which the maintenance manager must use.
When the maintenance engineer is faced with the problem of managing equipment subject to predetermined preventive maintenance, one of the tasks is to determine the preventive maintenance interval that economically optimises (other factors could also be optimised) the series of interventions that can be carried out on a physical asset over a specified period [1]. For this predetermined maintenance, the interventions can be of two types: corrective interventions originated by a failure of some component of the equipment (failure mode) and preventive interventions carried out after a certain number of hours, kilometres, or units produced. The former is produced randomly, and their number and However, to the best of our knowledge, very few studies provide rules and guidelines to modify the default maintenance interval optimally, and this paper helps to fill that gap. When the maintenance manager has censored data due to the implementation of preventive tasks, the proposed methodology represents an advance with respect to previous works such as [21][22][23][24]42]. This article differs from previous studies in providing the maintenance manager with a prediction of the length of the optimal preventive interval as well as a decision rule to improve the interval being used, based on the analysis of the results of various scenarios, and by using the methodology able to deal with right-censored data. This rule guides the manager in the direction of modification of the preventive interval to be used (increasing or decreasing it) and the intensity of the modification. In this way, the uncertainty induced by using censored data can be overcome. Therefore, to fill the identified gaps, two research questions are formulated: RQ1: How far is the obtained result from the optimal solution when the failure data are censored?
RQ2: What strategy would help to move closer to the optimum based on censored data?
The developed methodology guarantees that the solution found is optimal from a mathematical point of view, incorporating the time right-censored data, together with different types of costs and income. In addition, it applies a Semi-Markovian approach of transition between states [43][44][45], plus the z-transform, to find the optimal maintenance interval, which, as far as we know, is the first time it has been applied with right-censored data. It is also relevant that the developed methodology enables the assessment and comparison of alternative maintenance policies.
The rest of the paper is organised as follows. Section 2 provides the real data for this study, the research questions formulated, and the methodology applied to develop the mathematical optimisation model based on semi-Markov processes and z-transform. Section 3 presents the results for the different cases studied. In Section 4, those results are highlighted, discussed, and compared. Finally, Section 5 provides the conclusions drawn from the study.

Materials and Methods
To adapt the preventive interval to the particular equipment conditions usage, the manager responsible for maintenance should use the information generated by the equipment itself. The failure information generated is right-censored and it is a challenge to determine the behaviour before the failure of the equipment. The method followed to find the optimal interval when the available failure data is right-censored is explained throughout the article. This process is graphically summarised in Figure 1.

Real Case-Data Selection and Information Processing
This work analyses the behaviour of the O-rings located in the cooled crossover of a 12V diesel engine with 2 litres per cylinder. The crossover is a jacketed tube through which exhaust gases flow. The engine coolant cools this tube, and two O-rings are installed at each end of the tube to prevent coolant leakage. High temperatures affect the O-rings and are responsible for their degradation. In this experiment, failure data have been collected for several years, following the preventive replacement policy of O-rings every 4000 h (scenario A). During this time, it was observed that none of the O-rings had reached this limit, remaining at values very far from this value. This fact allows us to consider that the values of the observed failures can be considered as "not" censored. These failures are listed in Table 1. In addition, in some cases, the O-rings were replaced preventively because the last failure occurred close to the time of the preventive replacement. These are the censored values that are collected in Table 2. Obtaining the optimal preventive interval value t 0 for two group of cases.    190  276  296  409  429  430  437  454  481  492  498  499  543  552  552  552  552  577  603  604  604  612  619  658  675  683  696  702  742  754  773  797  812  836  881  889  912  913  942  974  994  994  1014 1015 1024 1025 1041 1105  1183 1203 1211 1236 1238 1240 1249 1274 1295 1304 1312 1343  1345 1407 1413 1421 1442 1447 1486 1492 1542 1581 1601 1621  1630 1675 1735 1892 1926 1960 2006 2101 2242 2437 2450  Subsequently, the value of the preventive interval was modified, settling at 1000 h. In this case, the failure data were considerably reduced as many observations were rightcensored by the predetermined preventive maintenance performed at 1000 h (scenario B). This censorship radically changed the way of analysing the problem, since the results will depend less on the type of technique or method used and more on the type of censorship [46] and the extent of the censorship. Scenario B, corresponding to the preventive interval of 1000 h, is the one that is usually presented to the maintenance manager.

Determination of the Failure Distribution Function
Data from maintenance interventions are collected at the machine or through an automated procedure using a network of sensors. Its mathematical use for optimisation requires treating these data until obtaining information on the appropriate protocol that can be understood by the model proposed for the simulation. This treatment often supposes a loss of veracity. In other cases, it may happen that the information on which the treatment is started is not adequate. Achieving the most reliable information to the original exposed in the desired protocol is one of the objectives to be achieved in any optimisation process.
In our case, the original starting data could be classified into two groups, those generated during the maintenance activities and those generated during the economic management of those maintenance activities. The first group included failure data and preventive activity data. These data are the hours of operation of the equipment in which preventive and corrective interventions are carried out and the duration of these interventions. In the first group, they were counted from the last repair or preventive change until the failure appeared or until a preventive task was carried out, and the equipment was inactive due to the intervention. The second group of data included the costs of each of the corrective and preventive interventions that were carried out and the income obtained from the use of the equipment.
For the first data group, it was necessary to establish the method to determine the method of mathematically obtaining the failure distribution. Many examples can be found in the literature [47,48]. In our case, a Weibull probability distribution function was used, although other authors analyse different methods to estimate the parameters of the Weibull distribution [49,50]. Other authors have previously used other distributions. However, the Weibull distribution is the one that best adapts to the process of failure appearance in industrial assets. Furthermore, the Weibull distribution includes the exponential distribution and has the advantage of using two or three parameters instead of a single parameter of the exponential function. The exponential survival function has the hazard rate constant, while the Weibull survival distribution extends the exponential distribution to allow constant, increasing, or decreasing hazard rates.
The estimation of the model parameters becomes a difficult task when censorship radically changes the available information and when the opinion of experts does not help to understand the system's behaviour [51].
Many authors have estimated the parameters of the Weibull distribution for censored data through the Maximum Likelihood Estimation method (MLE), using algorithms for their numerical resolution such as the Maximised Expectation (EM) algorithm, see [52][53][54]. Other authors use and compare various methods such as the Synthetic Minority Oversampling TEchnique (SMOTE) [55], MLE, Least Square Estimation (LSE) [56], Weibull [57] probability plot, regression of the range of medians with the Benard approximation, and even combine methods, see [58,59]. Bayesian estimators [60] and other types of linear estimators [61] have also been used.
At the same time, the repair and preventive replacement average times and the cost and income average of each type of intervention (corrective or preventive) are also calculated. This last type of data generally poses more difficulties to obtain than to calculate it. However, despite not being as obvious as the calculation of the previous means, the calculation of the failure distribution function can lead to significant management errors, mainly due to the lack of coherence of the starting data. When these failure data are right-censored, due to the performance of preventive maintenance, poor optimisation of the preventive interval will be achieved. This article focuses on the influence of the maintenance management data (first group) to determine the distribution function to perform the preventive interval optimisation calculations.
In this research, a reduced number of data will be used, with which we will obtain an exact solution for the optimal preventive interval. If we had used a large number of data, as is the case with information collected online through sensors, the result would have been the same (exact solution). Nevertheless, in both cases, as will be shown, the results will be far from the optimal solution.
The equipment was failing and was repaired, with few cases of preventive interventions (when the scheduled time according to predetermined maintenance was reached), which also lasted for a few hours of operation. This case represents a situation where the censorship was practically nonexistent, and the observed distribution function was very close to the real one. On the other hand, Tables 3 and 4 represent the failure and preventive data when a predetermined preventive threshold was established at 1000 h of operation. In this second case, the behaviour of the O-rings would not be known beyond 1000 h. In both cases, it does make sense to estimate the behaviour beyond the predetermined threshold or to analyse the differences between the optimum intervals in both cases. Therefore, a procedure to obtain estimations from the second case closer to those obtained for the first one can be distilled. Looking to gather more information, the following procedure was adopted: • From the failure data and preventive maintenance data collected in Tables 3 and 4, the observed function was calculated applying the range of medians method and using the Benard approximation. • From the observed function, the theoretical Weibull function that fit the best was determined. Then, the least-squares method was used to obtain the theoretical function's two or three parameters. • Subsequently, the theoretical function was used to optimise the preventive interval economically.
In Figure 2, the procedures for the data in Tables 3 and 4 are summarised. The Weibull failure distribution function that best fit the observed function had two parameters, the shape parameter with value 1.88 and a scale parameter (characteristic life) with value 3603. These data were very different from the data obtained for the case of uncensored data and would give rise to a very different preventive interval. The detailed process for the data in Tables 1 and 2 can be verified in Sánchez-Herguedas et al. [42]. The Weibull functions adjusted to the uncensored data were, in that case, W(2.36, 1317, 0) and W (1.95, 1202, 117), where the first value corresponded to the shape parameter, the second to the scale parameter, and the third to the guaranteed life parameter.

Semi-Markov Maintenance Model with Returns for a Finite Period
To study the behaviour of O-rings concerning failure, we need a mathematical model that reflects this behaviour, such as the semi-Markovian model of three states, which is used with the data that appear in Tables 1 and 2. This model applies to the predetermined preventive maintenance of much industrial equipment, since it contains the two most representative maintenance states, the state of the equipment when corrective maintenance is performed after a failure (State S2) and the state of the equipment when preventive maintenance is performed after a predetermined number of operating hours (State S3). In these two cases analysed, the data correspond to preventive intervals of 4000 and 1000 h of operation, respectively. State S1 corresponds to the period in which the equipment is performing its required function. In this model, the equipment and others with the same characteristics evolve over time, changing the state according to the law of probability. When the system is in the operational state, the probability of equipment failure is distributed as a Weibull distribution. If the equipment fails, the system goes to the corrective state (S2) and develops a corrective intervention. Suppose the equipment does not fail after a time τ, the system transitions to the preventive state, developing a preventive intervention. Once these maintenance interventions (corrective and preventive) have been carried out, the system returns to its operational state. Both types of interventions have associated costs.
On the other hand, during the operational state, the equipment generates income. Over time, these costs and income are accumulated in a variable called the average accumulated return, V(m). The optimisation of this variable will allow the calculation of the optimal preventive interval, τ 0 . These transitions between states and the accumulation of returns are graphically expressed in Figure 3.

Formulating a Difference Equation System for Average Accumulated Return
The model that reflects the behaviour of the system is a semi-Markovian model. The variable to optimise is the average accumulated return in the successive transitions between states. In m transitions, starting from state i, the system accumulates m returns, which added to their respective signs (+ for income and − for costs) will constitute the accumulated return Q i (m) in m transitions from state i. For each value of i and m, the accumulated return is a random variable because once the initial state is determined, the next state is unpredictable, as well as the following states. Thus, in m transitions, the system can evolve in many ways. For this reason, it is interesting to know the average accumulated return instead of the accumulated return itself.
Let v i (m) = E(Q i (m)) be the average accumulated return in m transitions when the system starts from the initial state i. To calculate v i (m), a difference equation is constructed, separating the m transitions into two stages. The first stage is the one constituted by the transition from the initial state i to the next state j. As the second state can be any of the states of the system, the return v i (1) in a single transition is a random variable that can reach the values r i1 (1), r i2 (1), · · · , r in (1), with the respective probabilities p i1 , p i2 , · · · , p in . The average return in that transition can be formulated as follows: Now, the process continues with the remaining m − 1 transitions. Once the first transition has been made, the system is placed in a state called j, where j takes one of the values 1, 2, 3, · · · , n. The average accumulated return v j (m − 1) in the following transitions is a random variable that can reach the values v 1 (m − 1), v 2 (m − 1), · · · , v n (m − 1), with probabilities of p j1 , p j2 , · · · , p jn , respectively, which remain constant throughout the m transitions, since the process is homogeneous. Therefore, the expected value of the return of the remaining m − 1 transitions can be formulated as: v j (m − 1)) = ∑ n q=1 v j (m − 1) · p jq . We conclude that the expected average return of the system in m transitions is calculated according to Equation (2): The vector V(m) is defined as V(m) = (v 1 (m), v 2 (m), · · · , v n (m)) t , where the superscript t indicates the transpose in the matrix sense. This last equality can be written as a difference equation: where P is the matrix of transition probabilities. The Equation (3) gives the average accumulated return in m transitions from any possible starting state i.

Solving the System of Difference Equations by Applying Transforms Z
Solving this difference equation requires the use of z transform and Laurent's series. To reach its resolution, it is necessary to include the data and functions involved in the calculation: distribution functions of the time spent in each state, the returns for each state, and the matrices that describe the process.
Time-related probabilistic functions for the tree states are as follows: • F(t) is the Cumulative Distribution Function (CDF) of equipment failures, and f (t) is the Probability Density Function (PDF). Based on the discussion in Section 2.3, we will use the three-parametric Weibull distribution. • G(t c ) is the distribution function of the time the equipment remains under corrective maintenance and g(t c ) is its probability density function. • H(t p ) is the distribution function of the time that the equipment remains under preventive maintenance and h(t p ) is its probability density function. Matrices to describe the process are as follows: • P, the transition probability matrix between states (p ij is the probability of going from State i to j). • F, stay time matrix (F ij is the average time in State i before the system goes to State j).
With these data, making the z-transform of V(m + 1) = V(1) + P · V(m), during the calculation, it is required to determine the matrix I − z −1 P to reach Equation (4).
Developing the matrix (I − z −1 P) −1 , the value of Z[V(m)] is reached and decomposed into simple fractions for each of its terms ]. Subsequently, using the appropriate Laurent expansion, the inverse z-transform of each one is calculated, obtaining the equations that describe the average accumulated returns in each transition for each of the starting states of the process, v 1 (m), v 2 (m), and v 3 (m). The full development can be followed in Sánchez Herguedas et al. [62]. The average accumulated return in m transitions starting in the operating state is described by Equation (5).
The average accumulated return in m transitions starting in the corrective state is described by Equation (6). To deduce it, the same reasoning is followed as in the previous case.
The average accumulated return in m transitions starting in the preventive state is described by Equation (7).

Optimisation of the Averaged Accumulated Return When Starting from the Operational State
Deriving and equalising to zero is possible to achieve the mathematical formula of the preventive interval that economically optimises the average accumulated return for each transition, starting from each of the states dv 1 (m) dτ = 0, dv 2 (m) dτ = 0, and dv 3 (m) dτ = 0. The results obtained for each state are: For the case of starting in the operational state, Equation (8).
For cases of starting in the corrective and preventive states, Equation (9).

Results
Applying the procedure followed in Section 2.2 to obtain the Weibull distribution function and Equation (8) to obtain the optimal preventive interval, the results expressed in Table 5 are reached. The average time of corrective and preventive tasks is eight and seven hours, respectively. Case A corresponded to the uncensored data presented in Tables 1 and 2. Case B corresponded to the data with censorship, when the maintenance policy implies a preventive change of the O-rings when reaching 1000 h of operation without failure or when the equipment reaches a total of hours that is a multiple of 1000.
In this case, the Weibull function was altered. The failures were spread over a more extended period, and 38% of them occurred after 3600 h. Obviously, this is not what happened when the data were uncensored, and 62% of the failures appeared before reaching 1300 h. The value of the optimal preventive interval given by the model was very high, 10456. This is due to the flattening and lengthening that the failure distribution curve suffered, as a consequence of the right-censored data. In cases B-2 and B-3, this situation also occurred. Case B-2 considered a failure and the first preventive maintenance at 1000 h. It affected the list of times considered in the Median Rank Regression method. Case B-3 was similar to the previous one; it considered a failure of a preventive maintenance of 1000 h ordered in half of the 87 preventives analysed in Table 4. If the calculations are made only taking into account the failure data and ignoring the censored data at 1000 h, case B-1 is the situation where there is a tendency to group the failures. However, this grouping occurred very early in the life of the O-rings and led to optimising the preventive interval at 353 h. Another group of cases was made up of cases B-4, B-5, and B-6. The distribution of failures was more concentrated, as in case A. However, it was concentrated towards a greater number of hours. For this reason, the optimal preventive interval was established around 2000 h, well above 1059 in case A. All the cases were far from optimal, failing to find a strategy that was close to optimal.
Of course, using this censored data, the optimal preventive interval suitable for a correct economic management of maintenance would not be found. Finding a reason that explains this behaviour is not easy, but it could be thought that the data from A or B cases are not adequate, which could be expected, for example, when two or more failure modes are being mixed.
A further step can be taken in verification. The study can be carried out using the data from case A, Tables 1 and 2. We can take the data from Table 1 and artificially censor the failures greater than 1000 h. This is how Table 6 is generated. We are in a new scenario C, where we are going to show three different cases. In the case of C-3, only failure data less than 1000 h (42 failures) were considered. The case of C-2 employed failure data up to 1000 h, and censorship corresponding to the cases that failed after more than 1000 h. A total of 41 censored data appeared due to the 41 failures produced over 1000 h. This was intended to generate cases from case A. The data used in the C-3 case were those of the first part of Table 6 (darkened part), and the data used in case C-2 were all the data of Table 6. Table 7. Results of the cases (cases A, C-1, C-2, C-3, and D-2) analysed. Parameters of the Weibull distribution function and the optimal preventive interval. Case D-2 corresponded to a new scenario D, where the data from modified Table 1 were used censoring the data in the range of 400 to 2500 h of operation. Case D-2 corresponded to artificial censorship at 900 h, in the same way that case C-2 corresponded to artificial censorship at 1000 h.
The calculations were carried out for two transitions, starting from the operational state v 1 (2). These cases were compared to obtain the economic differences using Equation (5), which corresponded to the average accumulated return. The results are expressed in Table 8. The fifth column represents the return per hour of operation for each case. It would be the same in all even transitions.

Discussion
The results in Table 7 show that, when calculating the optimal preventive interval using the formula of our model, its value was always less than the predetermined preventive interval used (censorship). It is observed that, for a preventive interval adopted of 1000 h, case C-1, the derived optimal preventive interval reached the value of 705 h, while, when the preventive interval used was 900 h, case D-1, the optimal preventive interval reached the value of 656 h.
If with the same data used for cases C-1 and D-1, we modify the value of the predetermined preventive interval used between the range of values 500-2500 h, the behaviour of the optimal preventive interval shows a tendency to increase until reaching the value 899 when the preventive interval used was 2500 h, which was the case for which all fault data were used (see Figure 4). Why would the value 1059 not be reached? This value was not reached because, in these calculations, only the values in Table 2 were considered. This value would have been reached in the calculations if all censored data were considered (Tables 2 and 6). The increase in the optimal preventive interval between the D-1 case of censorship to 900 h (656) and the case of 2500 h (899) was 27%. This value is comparable to the 22.4% value obtained when verifying the increase in the optimal interval between case A (1059) and case A-0 (822) using the data from Tables 2 and 6. In both cases and in other cases with different data, values close to 25% were obtained. This is because the preventive intervals used were 1000 and 900 h, values very close to the optimal 1059 and 899.
This value close to 25% is of particular interest, because it gives a reference value for the possible variation of the preventive interval when the starting data are censored. We have investigated and obtained a preventive interval of 1059 or 899 h, respectively. However, the maintenance manager does not know if their preventive interval used is close to its optimal value (because they do not let their equipment work until failure). If the value of the optimal interval that the maintenance manager obtains with the model is 30% less than the interval used, the maintenance manager would have a preventive maintenance policy with a high interval, which should be reduced. On the contrary, if the value obtained from the model is 20% less than the interval used, the preventive maintenance would have a low interval, which should be increased. This reduction or increase will be greater or lesser depending on how far the value of the optimal interval obtained from the model was from 25% lower than the interval used, Figure 5. This provides the maintenance manager with a forecast on the length of the optimal preventive interval and a decision rule to improve the interval. In Figure 4, an abrupt change in the direction of the curve is observed. It can be seen that for values greater than 700 h (point 700 on the abscissa), the curve experiences a change in trend that is maintained until the end. This indicates that the behaviour of the fault distribution function should be studied when the number of failures is very small compared to a large number of censored data due to preventive maintenance. In this case, from values above 75% of the right-censored value (value of the adopted preventive interval), we can find values with which to obtain good to good results.

Conclusions
To carry out a study where the preventive interval is calculated, it is necessary to have adequate failure data, eliminating from the study those failure data due to other factors (for example, other failure modes). It is usually necessary to have uncensored failure data to optimise the preventive interval, but these are rarely accessible. Most of the data collected by maintenance managers are right-censored due to the performance of predetermined preventive maintenance tasks. The inclusion of censored data in the study introduces uncertainty in calculating the observed function, the theoretical function, the average accumulated return, and the preventive interval. These censored data disturb the calculation of the preventive interval, since they introduce uncertainty. The proposed model allows maintenance managers to use the censored data to guide the modification of the maintenance policy, by optimising the preventive interval, which will have an economic improvement.
The following conclusions can be drawn from using our model's formula to calculate the preventive interval for right-censored data: • If the optimal preventive interval obtained using the model is more than 30% lower than the preventive interval used, the maintenance manager can decrease the interval used. • If the optimal preventive interval obtained is less than 20% less than the preventive interval used, the maintenance manager can increase the interval used. • This increase or decrease in the value of the optimal preventive interval will be more significant or less depending on how far the value obtained from the model is from 25% lower than the interval used.
This decision rule is of particular importance for the maintenance manager, since it allows them to modify the preventive interval in the correct direction. If censored data are used in the study, the optimal preventive interval increases as we also consider failure data of longer duration (adopt longer preventive intervals). These results are equally relevant regardless of the number of data considered when the number of failures is sufficient to define a failure distribution function according to the behaviour of the physical asset.
This research will continue in the future along two different lines, the first one by using other families of estimators and comparing the results, but the second one intends to extend the model by considering a fourth state for the system. The intention is to split the operational status between regular operational status and degraded operational status, where maintenance would be required by either legal or business constraints forcing consideration of its operation and whether in the new state the probabilistic function for failure becomes different. In this way, it would be possible to enable a stronger alignment between production and maintenance policies, where a more integrated perspective for the business as a whole is envisaged.

Acknowledgments:
The authors want to thank the Spanish Agencia Estatal de Investigación for the support provided through the program "Retos para la Sociedad edition 2018".

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: