1. Introduction
Data-driven decisions for complex equipment and its critical components are essential for optimal system performance, thereby meeting the premise of success for asset-intensive industries. This premise can be met through Physical Asset Management (PAM). PAM focuses on a sustainable outcome through equipment productivity—considering, for example, the establishment of an optimal maintenance policy that allows for reaching the goals desired for such equipment, whether these are to maximize availability or minimize costs, or others.
Among the maintenance policies, one of the most interesting is predictive maintenance, which defines the most suitable moment for intervention for a piece of equipment to minimize the probability of failure through diverse techniques. Condition-Based Maintenance (CBM) can be used as a predictive policy to estimate equipment failure rate and reliability based on current and future conditions. This is achieved through constant monitoring of these conditions and the use of tools such as the Proportional Hazards Model (PHM), which assigns a weight to each condition to then calculate the failure risk of the equipment at that moment. Moreover, this information needs to be complemented to decide on the intervention of the asset; the different values of the failure risk that determine when keeping the equipment operating is acceptable and when to conduct maintenance need to be determined; i.e., certain ranges that define the different possible decisions should be established.
Jointing CBM with PHM, it is possible to monitor vital signs or covariates to predict failure rate, reliability functions, and maintenance decisions. This analysis requires defining the transition probabilities of asset conditions that evolve over states, over time. When only one covariate is being assessed, the model’s parameters are commonly derived from expert opinion to provide state bands directly. However, the challenge lies in multi-covariate problems, where arbitrary judgment is inappropriate, since the composite measurement does not represent any physical magnitude. Moreover, selecting covariates requires more procedures to prioritize the most relevant ones.
Currently, there is no default method that accurately and systematically calculates the ranges or status of the equipment for different contexts and conditions. Therefore, in this study, a model is proposed that comprises a method for calculating them using different Machine Learning (ML) algorithms, which use historical information of equipment condition, interventions, and failure. Consequently, the present work aimed to determine multiple covariate bands for the transition probability matrix by supervised classification and unsupervised clustering. We used ML to strengthen the PHM model and complement expert knowledge. Furthermore, this paper allows obtaining the number of covariate bands and the optimal limits of each one when dealing with PHM-CBM predictive maintenance decisions.
In summary, we introduce a new contribution in terms of formulation, analytical properties, and practice for multiple covariate problems, whose bands have a combined measurement unit that often does not represent any physical magnitude. Therefore, to the best of our knowledge, using non-arbitrary criteria for multi-covariate bands has yet to be addressed in depth. Here is when our novel proposal for a CBM-ML condition assessment comes into play, expecting to be of value to asset managers.
This work is divided as follows: In
Section 2, the different concepts applied in the model and a literature review are introduced. In
Section 3, the proposed model and methodology are presented.
Section 4 addresses a case study in which the model is applied. Then, the results of the case study are discussed. In
Section 5, the conclusions of this work are presented.
3. Model Formulation
3.1. Data Pre-Processing
First, treating data prior to their use is critical, i.e., to ensure the absence of null, missed, or duplicate data. Additionally, different tables are integrated in order to have all the available information in a single document.
In turn, it is important to identify types of maintenance and classify them using a binary system for their subsequent use in the model, in which one and zero correspond to preventive and corrective maintenance, respectively.
The difference in days between each intervention is obtained, and NaN data are treated, yielding two cases:
This generates a table with only data relevant to the development of the model, i.e., with dates, an identifier for assets and their components if existent, the type of intervention in the binary system, the difference of days between interventions and a column for each covariate.
3.2. Parameters Estimation
The
and
parameters represent the stages at which the asset and its characteristic life are, respectively. To estimate them, the reliability of all data should be estimated first. In this case, it is obtained through the Lewis method:
where
is a binary variable that takes the value 1 when a censure occurs (e.g., preventive interventions) and 0 when a failure happens. Then, reliability is calculated as a function of time, as proposed by Jardine (1987) [
13], who indicated that reliability can be model as a two-parameter Weibull distribution through:
By linearizing the aforementioned expression, the following equation is obtained:
where
is the shape parameter (slope) and
is the scale parameter.
3.3. Data Division and Standardization
The following step corresponds to the division of the database into two groups: a training group and a testing group. This division is mainly necessary for using machine learning algorithms, as they require previous training before delivering final data. In this way, the training data group will be used with this purpose, and the testing group will be employed to obtain valid results from these algorithms.
There is not a single rule for the quantity of data that should be left in one group or another, but the division of the whole database into 3/4 for the training group and the remaining data for the testing group is customary.
In turn, before working on covariates data, it is recommended to standardized the data. This implies re-escalating the data in such a way that they have the same order of magnitude among them, as well as the same minimum and maximum values. This facilitates the analysis of the weights of each covariate, as they can be compared directly, allowing for establishing their relative importance to one another.
3.4. Covariates Weight Calculation
Before conducting the clustering of the , their corresponding weights should be calculated with respect to the failure rate associated with each sample from the database. To calculate them, this work proposes using a random forest algorithm, which receives the values of each covariate and their corresponding failure rates (obtained in the previous step) as input parameters. With this, the algorithm delivers the relative importance of each z with respect to its failure rate, which will be used as their respective weights.
It should be noted that, since this algorithm has diverse options in terms of hyper-parameters to enter prior to its use, it is recommended to test different combinations of the same and define which of them yields the best results for the database. This can be conducted automatically with different code commands associated with machine learning.
It is noteworthy that the calculation of the weights of the different is not the focus of this work, and therefore this procedure is solely used to obtain values related to the database under use and move to the following steps of the proposed model. The validity of the weights obtained through this algorithm has not been proven; thus, in cases of needing to calculate them, other proven alternatives are recommended.
Once the weights are obtained, a new parameter is introduced into the database, which will be used for cluster division. This corresponds to the sum of
of each covariate; i.e.:
3.5. Clustering
To use the clustering algorithm, data simplification is recommended, which is—in this case, —subdivided into class intervals. Intervals will be assigned their corresponding class mark, which reduces the quantity of data the algorithm has to work with. To determine the number of intervals into which the database is divided, Sturges’ rule is employed, which is based on the number of samples to define the intervals.
Afterwards, this work proposes two different clustering algorithms, one through k-means and the other one through the Gaussian mixture model (GMM). The main hyper-parameter that both methods require is the number of clusters into which data are hoped to be divided. To determine this number, different numbers can be input and the results compared to select one of them, as conducted with the random forest. Among the aspects to consider for this choice are the different values yielded by the algorithm, such as the silhouette index mean or Akaike values (BIC and AIC), which allow for comparing the results to decide what number of clusters better adapts to the database. In turn, it should be kept in mind that the selection of simpler models is always recommended; i.e., if the difference in the adaptation between two numbers of clusters is low, the cluster that divides data into a number of clusters much smaller is preferred. What these clusters will represent in the analysis to be performed should also be considered, as it may provide some clues about the most realistic number of clusters for a specific case. The number of class intervals into which data were divided at the beginning should not be overlooked, because it is likely that if these are divided into the same number of clusters, the results of such division may be biased and not necessarily be the best option.
Finally, once the number of clusters is selected and the database is divided via the algorithms mentioned above by means of their corresponding parameters, the last step is to classify/mark each datum in the cluster to which it belongs in order to facilitate the observation of the algorithm’s result.
3.6. Covariate Bands- Calculation
The covariate-bands calculation is performed using the following steps:
Through distance to centroids: Based on the clustering by GMM and k-means, the centroids of each cluster are calculated; then, the average between successive centroids (i.e., average between 1 and 2, 2 and 3, etc.) is calculated to define these values as limit ranges. In this way, a range value indicates in which cluster each will be found.
Through probabilities: Using the found clusters, the probability that each observation belongs to this cluster is calculated to classify them. Then, the minimum values of each value belonging to each cluster are identified, and these delimit the ranges.
Cluster border classification methodology: In cases where the value is close to one of the classification borders and elucidating what cluster it belongs to is not easy, a solution using the probabilities given by GMM is proposed. If the value has a difference smaller than 5%, a random choice is applied using the probabilities of belonging to each cluster, i.e., how likely it is that a certain value belongs to one cluster or another.
3.7. Transition Probability Matrix
Once ranges are calculated, the next step is to generate a transition probability matrix. To this end, the state to which each database sample belongs should be identified. Then, ordering the database chronologically, state transitions are identified, specifically how many times one state
i transited to a state
j, for all possible state combinations. This value will be identified with the parameter
. Subsequently, the following parameter to be calculated is
, which represents the time for which the equipment remains in state
i. With these two parameters, the transition rates are calculated based on the expressions below:
Finally, transition probabilities are calculated as follows:
3.8. Reliability Calculation
At this step, the quantity of iterations to conduct is first established, for which an initial value for time and a value for the span between the initial and final value of time is required, as the lower the value, the more precise the result will be. Then, the number of
iterations performed is:
Subsequently, the exponent value of each covariate (
) should be obtained through:
where
is the weight of each covariate and
is the approximation interval length. The values obtained through the previous equation are expressed in a diagonal matrix and then multiplied with the transition probability matrix, which should also be expressed in a diagonal matrix, thereby obtaining matrix
L.
The following step is to obtain the product between
L from the previous iteration and the current iteration. The conditional reliability is obtained through the sum of each row of the last matrix. This is conducted using the product-property method explained in detail in [
15], for which the failure rate matrix is first estimated through the equation:
Using the above together with the transition probability matrix, the matrix and L are finally solved.
4. Case Study and Discussion
In this case study, the methodology presented in the previous section was used in a sample of 100 machines, each of them with four components. The critical components of interest can be assimilated as the electric motor stator of a fleet of haul trucks operating in a mine site. The voltage, rotation, pressure, and vibration of each component were measured, these being the covariates examined.
Regarding data pre-processing, the absence of null and duplicate data was first confirmed. Then, data were classified using a binary system into having preventive maintenance (1) or corrective maintenance (0). Afterwards, NaN values were completed using the following value, since, in this case, the use of data average may affect results due to data behavior.
Table 1 shows an extract of the 100-machine sample.
In
Table 1, it is possible to appreciate that the different columns correspond to various data related to the main groups: time and covariates. The first is the date on which the measurement of the covariates was carried out. Next, the Machine ID identifies which machine the analyzed component corresponds to. Then, the type of component is determined, and subsequently, the type of intervention, if it has been corrective or preventive. The time elapsed between interventions allows for estimating how long the component has been available, and finally, the individual measurements of the covariates. Voltage has been measured in V, rotation in rpm, pressure in Pa, and vibration in Hz. It is important to remember that although each covariate must work within a well-defined interval, the combined effect of all of them works as an indicator to define the status of the component and as a consequence of the corresponding machine. This last effect leads to a significant purpose of this manuscript: the proposal of an ML condition assessment to complement the expert criterion when dealing with multiple covariates and diverse measurement units.
In turn, to facilitate the use of data, a new table was created for each component to address them separately. In this case, only component 2 was used because it had the greatest amount of data, and the development of the other components is analogous to this one. Subsequently, as mentioned in
Section 3.2, parameters
and
were estimated, obtaining the values presented in
Table 2.
To confirm that such values are correct, reliability was calculated by Equation (
2), where
t is now a test time interval; in this case, time intervals that increase in 10 units were used. The value of
should match with the time interval in which reliability is approx. 37%.
The next step after confirming the results was data division and standardization. In this case, as mentioned during formulation, the database was divided into a training group, which was composed of 572 samples, and a testing group with 191 data points. Then, covariates were standardized in a 0.01–0.1 range, selecting these values to avoid zero and negative numbers, as these could alter the results. Subsequently, the weights of the covariates were calculated (rotation, pressure, voltage, and vibration) with respect to the failure rate using the random forest algorithm, obtaining the
parameters through Equation (
4).
In addition, data was divided into class intervals by means of Sturges’ rule, which yielded 11 intervals. After the above procedure, the two clustering methods of k-means and GMM were employed to define the number of clusters that better adapt to the database. In both cases, the result is 3, which was obtained from the results shown in
Figure 1 and
Figure 2.
The figures above show that the results of methods are similar for this database. In this way, the limits of each cluster were generated, which in turn define the states of the component. Different methods are proposed to achieve this.
The first one is through the centroids of each cluster. The limits between two clusters are defined as the medium points between both centroids, which are presented in
Figure 3 and
Figure 4, for each k-means and GMM, respectively.
From this point, only the GMM method was employed.
Table 3 specifies the numerical limits for data training, and the results of
Table 4 were used for data testing.
The second method for calculating the ranges of each state was based on the probabilities delivered by the GMM method. These can be graphically observed in
Figure 5.
In this case, limits are defined as the lowest point of the probability of belonging to a cluster, which is depicted as darker areas in
Figure 6. In
Figure 7, the two lowest points that will define these limits are also clearly observed.
Regarding the results for range calculation, both methods reached similar yet not equal results. This work does not focus on analyzing the advantages or disadvantages of one method over another. From this point, the calculations continued based on the range results obtained by probability.
When classifying each sample into its corresponding state, 557 data points were obtained that belonged to state 2 (central cluster), 12 to state 3 (upper cluster), and 3 to state 1 (lower cluster).
As an alternative result, it is considered that probabilities of belonging to one cluster or another are less clear for points close to the limit between clusters; therefore, a reliability range around this limit is proposed such that if the differences in probabilities of belonging between clusters is smaller than 5%, a point within this range is randomly classified into another cluster with equal probabilities of belonging to this limit. In this way, the quantity of data for this cluster becomes:
state 1 = 3 data
state 2 = 550 data
state 3 = 19 data
Subsequently, the probability transition matrix—in
Table 5 and
Table 6 for training and testing data, respectively—was calculated using the results obtained with this last method.
Finally, the conditional reliability function is estimated through the steps described in
Section 3.8. In this case, it is established that
is 360 and
is 1. Then, reliability with respect to time was obtained for both training and testing data.
Figure 8 and
Figure 9 present the corresponding conditional reliability functions using the probability transition matrices as input to the aforementioned product-property method.
These figures show that, for both cases, the reliability behavior for each state was similar. Specifically, it can be determined, based on the same, that state 3 corresponds to the state in which the component exhibited the highest reliability, followed by state 2 and then state 1, which presented the lowest reliability over time. This agrees with the quantity of data present in each state. This data analysis consisted of 100 machines with four components each and was aimed at defining the limit ranges of multiple covariates. The results obtained after pre-processing data to establish the treatment of NaN, null, or duplicate data were the estimates of parameters and using the Lewis and Jardine methods. Since the quantity of data associated with component 2 was larger, this was the selected component (considering that each piece of equipment has this component), obtaining and (days).
Once these parameters were obtained, data were divided, allocating 3/4 of them—572 data points—to the training of machine learning algorithms and 1/4—191 data points—to validation using those algorithms. In addition, data points were re-escalated so all of them had the same order of magnitude, thereby facilitating further analyses. Subsequently, the weight associated with each covariate—in this case, vibration, rotation, pressure, and voltage—was calculated via the random forest algorithm, yielding the relative importance of these with respect to the failure rate to then calculate the , which corresponds to the sum of each covariate multiplied by its associated weight.
Defining the above, two clustering algorithms (k-means and GMM) were used to calculate the number of clusters that better adjusts to the database, which is three in both cases. To know the limits of these groups, ranges associated with each were defined through two methods. First, ranges were calculated based on the average between the distance from the centroids of two successive clusters. As a result, cluster 1 ranged between 0 and , cluster 2 ranges between and , and cluster 3 between and . In turn, ranges were defined based on the probabilities generated by the GMM method, defined as the lowest point of probability of belonging to a cluster. This method was selected as its ranges do not significantly differ from the first one, allowing for better observation of the clusters obtained. The results indicate that 3 data points belonged to cluster 1, 557 to cluster 2, and 12 to cluster 3. In connection, the classification procedure for values close to the cluster limit was also defined. As a result, 3 data points were classified into cluster 1, 550 into cluster 2, and 19 into cluster 3. With this information, the transition matrix was obtained for both training and testing data. Afterwards, the conditional reliability function was calculated for each cluster, indicating that cluster two had the highest reliability and cluster one the lowest over time. In this way, the objective of this study was accomplished, as the cluster bands for all covariates involved in the data used were defined in a coherent and reliable way.
For predictive purposes and using the estimated conditional reliability functions as inputs,
Figure 10 and
Figure 11 show the remaining useful life (RUL) for the training and testing data, considering the evolution of the clustered covariate data throughout the states. It is observed that Cluster-State 3 can be regarded as the best condition, since its decay from the maximum RUL presents smooth behavior. On the other hand, State-1 is the worst clustered condition, since its decline was the most aggressive over the working age. Although this effect could be expected, the novelty of the present proposal remained as aforementioned. Namely, experienced knowledge input could be straightforward when separating band limits for one covariate, since it has only one measurement unit. Nevertheless, it is difficult for an expert to deliver a band-limit value when dealing with diverse covariates, especially when the combined measure unit does not represent a physical magnitude to evaluate directly. Hence, the proposed ML method complements expert knowledge under a novel condition assessment.
Now that the effect of the condition has been explained, the other component of the predictive PHM model (Equation (
10)) is the contribution of age to the hazard rate. The values of
and
are all consequences of their interpretation in the conditional reliability and RUL from
Figure 8,
Figure 9,
Figure 10 and
Figure 11. All the
values in
Table 2 are higher than 1, indicating the wear-out stage in the asset life cycle. It means that the components under study were aging throughout the period. The characteristic life
agrees with the values of the working age as well. However, it is interesting to note that in the earlier stage of the life cycle, the impact of the clustered condition was far more significant than the effect of age; that is, the difference in RUL across the states was more notorious. On the other hand, due to the PHM model, the RUL differences became shorter at the later stage of the asset life. This means that at elevated working ages, the effect of time is overcome because the overall condition of the asset is so degraded when reaching an advanced operating span.
Finally, the proposed ML condition assessment adds innovative knowledge about the clustered covariate process. Namely, it handles several vital signs with diverse sources and measurement units, thereby enabling a holistic approach. When implementing a predictive policy, this novel condition assessment allows a better understanding of the asset’s operational health to intervene at the exact moment (in contrast to a fixed-age replacement policy), thereby maximizing the assets’ RULs. It is demonstrated that this novel ML approach to address maintenance policies can provide significant results as an alternative or complement to the expert criterion, offering advantages, especially when dealing with more than one covariate, and ultimately increasing the expectation and precision of the remaining useful life for the critical assets.