5.1. System Model Overview
The proposed system scenario represents a possible method of merging random forest as a data-driven approach to industrial FDD and model-based FDD approaches into a final hybrid approach, which possesses the powerful features of both approaches. This technique eliminates the main drawbacks of each approach individually, such as the lack of dynamicity and responses to sudden occurrences, in the case of traditional model-based FDD. Additionally, it provides validated, accurate and dynamic diagnostic rules that contribute massively in reducing the required diagnostic time and computational resources, compared to their online data-driven counterparts.
Figure 1 shows the two main diagnosis phases used in this research, data-driven and model-based, and how these two methods are combined into a new improved approach.
The following is a comprehensive explanation of each phase:
This phase consists of multiple internal steps essential to learning the best possible dynamic diagnostic rules using a random forest algorithm. Below, each step is discussed in an elaborate manner.
In this work, the dataset used is a multivariate, time-series dataset, of six pressure sensors, four temperature sensors, two volume sensors and one vibration sensor, which all possess a constant cycle of 60 s, placed in a hydraulic test rig to monitor its condition over time. For more details about this dataset and its previous applications, please refer the data collection and generation section in this work.
Complex industrial sensor systems often have hundreds or even thousands of sensors connected, simultaneously transmitting sensor readings data crucial to monitor and control those systems, each of which is considered a feature for analysis and model training. Thus, creating diagnostic models that only include valuable features is a necessity.
Implementing a model with fewer but more meaningful features has a significant impact on the overall system. First, the diagnostic model becomes simpler to analyze and interpret when fewer elements are included. Second, by eliminating some features of the dataset, the data would be less scattered, hence fewervariants, which can lead to reducing overfitting. Finally, the main reason behind feature selection is to generally reduce the time and computational costs required to train the model.
In practice, the RF algorithm can be applied to carry out feature selection as well, simply because the features are implicitly ranked based on their impurity during the formation of each decision-tree that create the forest. In other words, when top-down traversing a tree in RF, the nodes toward the top happen to have the largest impurity metric reduction, also known as Gini Impurity, compared to the nodes at the bottom. Thus, by determining a particular impurity decrease threshold, it is possible to prune the tree below this tolerance, in order to establish a subset of the most fitting or important features.
The data-driven FDD method implemented in this work is RF. Its intention is reducing the computational cost as much as possible, and RF is also used to perform feature selection using what is known as feature importance or permutation importance [
22,
31], since Gini impurity calculations are already measured during the RF training process, and only a slight bit of additional computation is required to complete the feature selection process.
In the context of RF, the Mean Decrease Accuracy (MDA), or permutation importance or feature importance, of a variable
in predicting
of classes is computed by the summation of the Gini impurities of
for all the nodes
where
is present and used, followed by the mean of the impurity decrease metric of all the trees
in the forest. The following equation comprehends the concept of feature importance using RF.
where
is the feature of interest, and
is the feature/variable used to split
.
The most popular implementation of feature importance provided by RF is the Python library Scikit-learn, where a pre-defined function, feature_importances_, is directly executed given the learned RF model. However, a team of data scientists at the University of San Francisco pointed out some bugs associated with this function, and implemented an alternative to generate more accurate feature importance results in [
32].
The foremost goal of any machine learning algorithm is to minimize the expected loss as much as possible. To achieve this, it is necessary to deploy some optimization equations to select the optimal values for some, or all, of the hyperparameters of the machine learning algorithm of interest.
The RF algorithm has plenty of hyperparameters. On the one hand, some are implemented on the overall forest level, such as the number of subjects randomly drawn from the dataset to form each tree, the choice of with or without replacement regarding the samples selection, and most importantly, the number of trees in the RF. On the other hand, some hyperparameters are on the tree level, which controls the shape of each tree in RF, i.e., the number of features drawn for each split, the selection of splitting rules, the depth of each tree, and many others. These parameters are typically selected by the user. Consequently, creating a method to efficiently select these hyperparameters can influence the performance versus the cost of RF significantly. In addition, the recent research done in [
33] emphasizes the significance of hyperparameter optimization specifically for RF parameters, as well as providing deep comparisons between numerous tuning and optimization mechanisms and software.
One of the key tuning strategies for RF is using searching algorithms to look for optimal parameters in a pool or grid of selected ones. Search techniques differ in their way of pool or grid creation, based on the mechanism applied to choose the successful candidates forming the bag of options. Some searching strategies use all the possibilities available as candidates to be exhaustively investigated, one by one, to select the optimal choice, as in a grid search algorithm. However, in random search, the bag of candidates is drawn randomly from the overall existing possibilities, which is not only a precious asset for reducing the search complexity, but also studies have proved that random search produces better accuracy scores for parameter optimization, compared to grid search [
34].
Random search refers to a group of searching algorithms that rely on randomness or pseudo-random generators as the core of their function. This method is also known as a Monte Carlo, a stochastic or a metaheuristic algorithm. Random search is beneficial in various global optimization problems, structured or ill-structured over discrete or continuous variables [
35].
Below is the pseudo code describing the workflow of a typical random search algorithm (Algorithm 2).
Algorithm 2 Random Search Algorithm |
Let is the cost function to be optimized or minimized. is a candidate solution in the search-space . |
1: Select termination condition , i.e., specific fitness measure achieved, or max number of iterations reached, and so on. |
2: Initialize C. |
3: |
(a) Randomly choose another position from the radius surrounding (the radius of the hypersphere surrounding C) |
(b)
|
|
In practice, the Scikit-learn library for Python machine learning provides a method, RandomizedSearchCV, which can be provoked by creating a range for each hyperparameter’s subject of optimization. By using the RandomizedSearchCV method over the predefined range, random search is performed to randomly select a candidate grid of possibilities within the range, then the K-fold cross validation technique is applied over the created grid. For additional examples of this method, refer to [
36].
This phase represents a clear model of the system, whether it is an actual physical model, a simulation, a knowledge-base semantically connecting the system component together, or a relational database. Based on the system model’s nature, the extracted, nested, conditional rules from the random forest are transmitted to a suitable form, i.e., in knowledge-based systems such as ontologies the rules are converted into SPARQL semantic queries [
37], regular SQL queries in case the system model is represented by a relational database, or in a simpler fashion the direct application of the extracted rules as a small conditional code that can be executed every diagnostic window to perform the diagnosis. This phase is crucial to minimizing the online diagnostic time and computational power needed, compared to provoking the testing RF algorithm over and over for each sliding window. Moreover, it provides the possibility of introducing online distributed mechanisms given the rules and the graphs creating them.
In this phase, the new time-series data generated by the system for a certain amount of sliding windows are stored and used to update the originally created RF, by performing the hyperparameter tuning again to find out if any alteration of the RF parameters could reduce the size of the overall RF and increase the accuracy at the same time. The new updates selection or rejection decision is highly dependent on the accuracy of the newly tuned RF.
5.2. Experimental Results
In this experiment, RF is used, following the steps in the data-driven flowchart in
Figure 1, to generate dynamic diagnostic rules to diagnose and monitor the health of a hydraulic test rig. Provided in the dataset, each component condition ranges between full efficiency, reduced efficiency and close to total failure. In this experiment, for the sake of simplicity, the healthy state is represented by the full efficient cycles, and the failures are represented by the cycles where the component is close to failure, while the partial failure state is excluded. Based on the previous fault description, there are four types of total failure in four different components to monitor: cooler total failure state, valve close to total failure, internal pump has a severe leakage and hydraulic accumulator close to total failure.
Table 2 explains the definition of each fault chosen for this experiment and some example cycles that contains each fault.
The hydraulic system described in this experiment contains 11 sensor readings from three types of sensors located in different components of the hydraulic test rig: 6 pressure sensors, labelled PS1, PS2 up to PS6, 4 temperature sensors TS1-TS4, and finally 1 vibrational sensor labelled VS1. The readings of all the 11 sensors from various cycles, containing the five different statuses shown in the table above, are collected in one labelled dataset of 11 features necessary to perform RF training and analysis.
As mentioned earlier, the selection of RF as the classification method in this work is done after carefully comparing the results of RF with respect to other famous classifiers, such as Logistic Regression (LR), Linear Discriminant Analysis (LDA), K-Nearest Neighbour (KNN), regular decision tree (CART), Naïve Bayesian (NB) and Support Vector Machines (SVM). The supervised machine learning methods shown above along with RF are used to perform a multi-class classification task, to classify the hydraulic test rig faults described in
Table 2. The following table,
Table 3, shows the classification results after performing multi-class classification using different classifiers. It is demonstrated clearly that CART and RF have elevated accuracy compared to the rest of the approaches. However, RF is an optimization of CART, which overcomes its tendency to form overfitted relationships with the training dataset.
The non-zero feature importance method is used to neglect the features with less impact of the learning process, by concentrating only on the features that contribute more to the model’s accuracy. The table below shows the importance of each sensor to the RF model calculated using Equation (1).
Table 4 shows the calculated importance of each one of the 11 sensors. There are a variety of options by which these importance values are analyzed and evaluated to achieve feature selection. One can pick the highest importance feature alone to represent all the features, or the highest three, highest six or just the non-zero ones to represent the whole pack. However, the most convincing approach is testing all the possibilities and making a logical accuracy versus complexity trade off. For each selected feature(s) scenario, the RF accuracies and the time complexity given
equation are calculated, where
is the number of trees in the RF and
is the size of the input data used for training, assuming that the number of trees,
, is constant for all the feature trials. As such, the time complexity is a factor of input data size, represented by the number of features included without sacrificing much or any of the model accuracy.
In
Table 5, four different RF model training experiments were conducted to find out the best number of features required to train an RF on 100 decision trees. In the first trial, the most important feature, PS4, is used alone to train the RF model. The second trial used the top three most important features, PS4, PS5 and PS6. The third trial applied the highest six features. Finally, only the non-zero features were selected to train the RF model. For all the above four experiments, the random forest had fixed hyperparameters which were randomly chosen: 100 trees and the maximum depth of 5. Furthermore, a 10-fold cross validation technique is used to compute each trial’s accuracy.
For this experiment, the highest six features are used for the training process since these features provided the best accuracy among all trials and showed lower time and space accuracy compared to the training using 11 and 8 features, respectively.
The following figure,
Figure 2, shows how the time complexity
and space complexity
for the RF are directly proportional to the number of features used. It is crucial to emphasize that the amount of accuracy sacrificed, and the added complexity tolerances, are totally dependent on the system used and one’s preferences, i.e., some other researchers would use the highest three features with 0.977 accuracy, if they are willing to lose more accuracy as the cost for the dramatic drop in both time and space complexities.
The next step is tuning the hyperparameters of the RF applied on the dimensionally reduced dataset of the selected six features, PS4, PS5, PS6, TS2, TS3 and TS4. The hyperparameters subject to tuning in this experiment are the number of trees in the RF, the maximum depth of the tree, the minimum number of samples required to split a node and the minimum number of samples required to form a leaf node. As the purpose behind RF creation in this research is to establish a set of base rules for fault diagnosis, the main hyperparameter of focus is the number of trees, to ensure lessening the complexity as much as possible, as well as minimizing each tree’s depth if possible.
A random grid of hyperparameters is created by applying a random search over a pre-defined range for each parameter separately, i.e., the number of trees is pre-defined to range between 1 and 1000, and only 100 possibilities are selected from the range to form the random grid for this hyperparameter, followed by RF training using one of the randomly selected pair of features at a time. The selection is validated using three-fold cross validation to calculate the accuracy of the RF model over a particular set of hyperparameters. A set of 100 randomly selected parameters are used to create the grid, which means 300 RF model trainings have been successfully executed, considering the three-fold cross validation over the set of 100 possibilities in the grid. Finally, the set of hyperparameters with the highest accuracy when applying three-fold cross validation is the one selected to generate the diagnostic rules.
The RF model trained after applying feature selection with randomly chosen parameters yielded an accuracy of using 100 decision trees forming the RF, with a maximum depth of five. However, the best hyperparameter tuned using cross validation over the random search grid improved the accuracy by to reach . Additionally, using only 49 trees in total instead of 100 over the same depth has dramatically decreased the complexity and size of the generated RF rules, while increasing the accuracy and speed of the diagnosis.
The best hyperparameters selected from the grid have 49 estimators, a minimum sample split of two, a minimum leaf samples of one and maximum depth of 83. It is worth mentioning that the accuracy of the RF using all the best hyperparameters increased the accuracy to , but this slight rise in the accuracy is not easily justified, especially when it is compared to the massive increase in the time and size of each tree in the RF, due to the large increase in the maximum depth.
Figure 3 shows one of the decision trees in the RF after feature selection and hyperparameter tuning.
Each tree in the RF can be translated into a set of nested if-else statements of rules. Moreover, the formed dynamic rules from the RF can be fed inside various system models to generate a hybrid approach out of the data-driven and the model-based ones. The dynamic rules extracted from the RF can be used as they are, and can be converted to SQL queries if the model is a relational database or SPARQL queries if the system model is represented by a semantic knowledge-base, such as ontologies.
The diagnostic rules can be extracted from the RF dynamically, using a few lines of code in Python language. For the sake of simplicity,
Figure 4 shows the tree in
Figure 3, pruned in a way that only the positive part of the condition after the root node is remaining, connected to a series of nested statements showing how this part of the tree is translated into clear dynamic rules.
The work in [
7] provided a graph-based FDD system for industrial systems using a model-based approach, based on creating a knowledge-base of the system under diagnosis, such as ontologies. Followed by manually feeding a set of static diagnostic rules created by the system expert, into the ontology, in a way that forms a causation relationship between the system sensors and the faults they lead to. In this work, we propose creating dynamic rules using RF, extracting these rules and feeding them into the ontology instead of the expert rules that are static, unreliable and unverifiable. Furthermore, the extracted diagnostic rules using RF can be applied in a variety of forms to fit the model expressing the system.
Figure 5, showcases the extracted rules from the optimized, hyperparameter tuned RF, and how these rules are transformed into various forms and types to match the nature of the system model. As mentioned before in this chapter, the diagnostic rules can be translated into SQL queries in cases wherein a relational database is the system model, or SPARQL queries if a semantic knowledge-base, such as ontologies, are used to represent the system. The rules extracted from each tree may be scheduled separately, or all together with the trees forming the RF.