5.1. System Model Overview
The proposed system scenario represents a possible method of merging random forest as a datadriven approach to industrial FDD and modelbased FDD approaches into a final hybrid approach, which possesses the powerful features of both approaches. This technique eliminates the main drawbacks of each approach individually, such as the lack of dynamicity and responses to sudden occurrences, in the case of traditional modelbased FDD. As well, it provides validated, accurate and dynamic diagnostic rules that contribute massively in reducing the required diagnostic time and computational resources, compared to their online datadriven counterparts.
Figure 1 shows the two main diagnosis phases used in this research, datadriven and modelbased, and how these two methods are combined into a new improved approach.
The following is a comprehensive explanation of each phase:
This phase consists of multiple internal steps essential to learning the best possible dynamic diagnostic rules using a random forest algorithm. Below, each step is discussed in an elaborate manner.
In this work, the dataset used is a multivariate, timeseries dataset, of six pressure sensors, four temperature sensors, two volume sensors and one vibration sensor, which all possess a constant cycle of 60 s, placed in a hydraulic test rig to monitor its condition over time. For more details about this dataset and its previous applications, please refer the data collection and generation section in this work.
Complex industrial sensor systems often have hundreds or even thousands of sensors connected, simultaneously transmitting sensor readings data crucial to monitor and control those systems, each of which is considered a feature for analysis and model training. Thus, creating diagnostic models that only include valuable features is a necessity.
Implementing a model with less but more meaningful features has a significant impact on the overall system. First, the diagnostic model becomes simpler to analyze and interpret when fewer elements are included. Second, by eliminating some features of the dataset, the data would be less scattered, hence less variants, which can lead to reducing overfitting. Finally, the main reason behind feature selection is to generally reduce the time and computational costs required to train the model.
In practice, the RF algorithm can be applied to carry out feature selection as well, simply because the features are implicitly ranked based on their impurity during the formation of each decisiontree that create the forest. In other words, when topdown traversing a tree in RF, the nodes toward the top happen to have the largest impurity metric reduction, also known as Gini Impurity, compared to the nodes at the bottom. Thus, by determining a particular impurity decrease threshold, it is possible to prune the tree below this tolerance, in order to establish a subset of the most fitting or important features.
The datadriven FDD method implemented in this work is RF. Its intention is reducing the computational cost as much as possible, and RF is also used to perform feature selection using what is known as feature importance or permutation importance [
22,
31], since Gini impurity calculations are already measured during the RF training process, and only a slight bit of additional computation is required to complete the feature selection process.
In the context of RF, the Mean Decrease Accuracy (MDA), or permutation importance or feature importance, of a variable
${\mathrm{X}}_{\mathrm{n}}$ in predicting
$\mathrm{Y}$ of classes is computed by the summation of the Gini impurities of
${\mathrm{X}}_{\mathrm{n}}$ for all the nodes
$\mathrm{d}$ where
${\mathrm{X}}_{\mathrm{n}}$ is present and used, followed by the mean of the impurity decrease metric of all the trees
$\mathrm{D}$ in the forest. The following equation comprehends the concept of feature importance using RF.
where
${\mathrm{X}}_{\mathrm{n}}$ is the feature of interest, and
$\mathrm{v}\left({\mathrm{s}}_{\mathrm{d}}\right)$ is the feature/variable used to split
${\mathrm{s}}_{\mathrm{d}}$.
The most popular implementation of feature importance provided by RF is the Python library Scikitlearn, where a predefined function, feature_importances, is directly executed given the learned RF model. However, a team of data scientists at the University of San Francisco pointed out some bugs associated with this function, and implemented an alternative to generate more accurate feature importance results in [
32].
The foremost goal of any machine learning algorithm is to minimize the expected loss as much as possible. To achieve this, it is necessary to deploy some optimization equations to select the optimal values for some, or all, of the hyperparameters of the machine learning algorithm of interest.
The RF algorithm has plenty of hyperparameters. On one hand, some are implemented on the overall forest level, such as the number of subjects randomly drawn from the dataset to form each tree, the choice of with or without replacement regarding the samples selection, and most importantly, the number of trees in the RF. On the other hand, some hyperparameters are on the tree level, which controls the shape of each tree in RF, i.e., the number of features drawn for each split, the selection of splitting rules, the depth of each tree, and many others. These parameters are typically selected by the user. Consequently, creating a method to efficiently select these hyperparameters can influence the performance versus the cost of RF significantly. In addition, the recent research done in [
33] emphasizes the significance of hyperparameter optimization specifically for RF parameters, as well as providing deep comparisons between numerous tuning and optimization mechanisms and software.
One of the key tuning strategies for RF is using searching algorithms to look for optimal parameters in a pool or grid of selected ones. Search techniques differ in their way of pool or grid creation, based on the mechanism applied to choose the successful candidates forming the bag of options. Some searching strategies use all the possibilities available as candidates to be exhaustively investigated, one by one, to select the optimal choice, as in a grid search algorithm. However, in random search, the bag of candidates is drawn randomly from the overall existing possibilities, which is not only a precious asset for reducing the search complexity, but also studies have proved that random search produces better accuracy scores for parameter optimization, compared to grid search [
34].
Random search refers to a group of searching algorithms that rely on randomness or pseudorandom generators as the core of their function. This method is also known as a Monte Carlo, a stochastic or a metaheuristic algorithm. Random search is beneficial in various global optimization problems, structured or illstructured over discrete or continuous variables [
35].
Algorithm 2 is the pseudo code describing the workflow of a typical random search algorithm.
Algorithm 2 Random Search Algorithm 
Let $\mathrm{R}\mathrm{F}$ is the cost function to be optimized or minimized. $\mathrm{C}$ is a candidate solution in the searchspace ${\mathrm{R}}^{\mathrm{n}}$. 
 1:
Select termination condition $\mathrm{T}\mathrm{C}$, i.e., specific fitness measure achieved, or max number of iterations reached, and so on.  2:
Initialize C. $\mathrm{C}=\mathrm{R}\mathrm{a}\mathrm{n}\mathrm{d}\mathrm{o}\mathrm{m}\text{}\mathrm{p}\mathrm{o}\mathrm{s}\mathrm{i}\mathrm{t}\mathrm{i}\mathrm{o}\mathrm{n}\in {\mathrm{R}}^{\mathrm{n}}$  3:
$\mathrm{F}\mathrm{o}\mathrm{r}\text{}\mathrm{T}\mathrm{C}$  (a)
Randomly choose another position ${\mathrm{C}}_{\mathrm{n}\mathrm{e}\mathrm{w}}$ from the radius surrounding $\mathrm{C}$ (the radius of the hypersphere surrounding C)  (b)
$\mathrm{if}\text{}\mathrm{RF}\left({\mathrm{C}}_{\mathrm{new}}\right)\mathrm{RF}\left(\mathrm{C}\right)\text{}\mathrm{then}$

$\mathrm{C}={\mathrm{C}}_{\mathrm{n}\mathrm{e}\mathrm{w}}$ 
In practice, the Scikitlearn library for Python machine learning provides a method, RandomizedSearchCV, which can be provoked by creating a range for each hyperparameter’s subject of optimization. By using the RandomizedSearchCV method over the predefined range, random search is performed to randomly select a candidate grid of possibilities within the range, then the Kfold cross validation technique is applied over the created grid. For additional examples of this method, refer to [
36].
This phase represents a clear model of the system, whether it is an actual physical model, a simulation, a knowledgebase semantically connecting the system component together, or a relational database. Based on the system model’s nature, the extracted nested, conditional rules from the random forest are transmitted to a suitable form, i.e., in knowledgebased systems such as ontologies the rules are converted into SPARQL semantic queries [
37], regular SQL queries in case the system model is represented by a relational database, or in a simpler fashion the direct application of the extracted rules as a small conditional code that can be executed every diagnostic window to perform the diagnosis. This phase is crucial to minimizing the online diagnostic time and computational power needed, compared to provoking the testing RF algorithm over and over for each sliding window. Moreover, it provides the possibility of introducing online distributed mechanisms given the rules and the graphs creating them.
In this phase, the new timeseries data generated by the system for a certain amount of sliding windows are stored and used to update the originally created RF, by performing the hyperparameter tuning again to find out if any alteration of the RF parameters could reduce the size of the overall RF and increase the accuracy at the same time. The new updates selection or rejection decision is highly dependent on the accuracy of the newly tuned RF.
5.2. Experimental Results
In this experiment, RF is used, following the steps in the datadriven flowchart in
Figure 1, to generate dynamic diagnostic rules to diagnose and monitor the health of a hydraulic test rig. Provided in the dataset, each component condition ranges between full efficiency, reduced efficiency and close to total failure. In this experiment, for the sake of simplicity, the healthy state is represented by the full efficient cycles, and the failures are represented by the cycles where the component is close to failure, while the partial failure state is excluded. Based on the previous fault description, there are four types of total failure in four different components to monitor: cooler total failure state, valve close to total failure, internal pump has a severe leakage and hydraulic accumulator close to total failure.
Table 2 explains the definition of each fault chosen for this experiment and some example cycles that contains each fault.
The hydraulic system described in this experiment contains 11 sensor readings from three types of sensors located in different components of the hydraulic test rig: 6 pressure sensors, labelled PS1, PS2 up to PS6, 4 temperature sensors TS1TS4, and finally 1 vibrational sensor labelled VS1. The readings of all the 11 sensors from various cycles, containing the five different statuses shown in the table above, are collected in one labelled dataset of 11 features necessary to perform RF training and analysis.
As mentioned earlier, the selection of RF as the classification method in this work is done after carefully comparing the results of RF with respect to other famous classifiers, such as Logistic Regression (LR), Linear Discriminant Analysis (LDA), KNearest Neighbour (KNN), regular decision tree (CART), Naïve Bayesian (NB) and Support Vector Machines (SVM). The supervised machine learning methods shown above along with RF are used to perform a multiclass classification task, to classify the hydraulic test rig faults described in
Table 2. The following table,
Table 3, shows the classification results after performing multiclass classification using different classifiers. It is demonstrated clearly that CART and RF have elevated accuracy compared to the rest of the approaches. However, RF is an optimization of CART, which overcomes its tendency to form overfitted relationships with the training dataset.
The nonzero feature importance method is used to neglect the features with less impact of the learning process, by concentrating only on the features that contribute more to the model’s accuracy. The table below shows the importance of each sensor to the RF model calculated using Equation (1).
Table 4 shows the calculated importance of each one of the 11 sensors. There are a variety of options by which these importance values are analyzed and evaluated to achieve feature selection. One can pick the highest importance feature alone to represent all the features, or the highest three, highest six or just the nonzero ones to represent the whole pack. However, the most convincing approach is testing all the possibilities and making a logical accuracy versus complexity trade off. For each selected feature(s) scenario, the RF accuracies and the time complexity given
$\mathrm{O}\left(\mathrm{T}\text{}\mathrm{log}\text{}\mathrm{n}\right)$ equation are calculated, where
$\mathrm{T}$ is the number of trees in the RF and
$\mathrm{n}$ is the size of the input data used for training, assuming that the number of trees,
$\mathrm{T}$, is constant for all the feature trials. As such, the time complexity is a factor of input data size, represented by the number of features included without sacrificing much or any of the model accuracy.
In
Table 5, four different RF model training experiments were conducted to find out the best number of features required to train an RF on 100 decision trees. In the first trial, the most important feature, PS4, is used alone to train the RF model. The second trial used the top three most important features, PS4, PS5 and PS6. The third trial applied the highest six features. Finally, only the nonzero features were selected to train the RF model. For all the above four experiments, the random forest had fixed hyperparameters which were randomly chosen: 100 trees and the maximum depth of 5. Furthermore, a 10fold cross validation technique is used to compute each trial’s accuracy.
For this experiment, the highest six features are used for the training process since these features provided the best accuracy among all trials, and showed lower time and space accuracy compared to the training using 11 and 8 features, respectively.
The following figure,
Figure 2, shows how the time complexity
$\mathrm{O}\left(\mathrm{T}\text{}\mathrm{log}\text{}\mathrm{n}\right)$ and space complexity
$\mathrm{O}\left(\mathrm{n}\right)$ for the RF are directly proportional to the number of features used. It is crucial to emphasize that the amount of accuracy sacrificed, and the added complexity tolerances, are totally dependent on the system used and one’s preferences, i.e., some other researchers would use the highest three features with 0.977 accuracy, if they are willing to lose more accuracy as the cost for the dramatic drop in both time and space complexities.
The next step is tuning the hyperparameters of the RF applied on the dimensionally reduced dataset of the selected six features, PS4, PS5, PS6, TS2, TS3 and TS4. The hyperparameters subject to tuning in this experiment are the number of trees in the RF, the maximum depth of the tree, the minimum number of samples required to split a node and the minimum number of samples required to form a leaf node. As the purpose behind RF creation in this research is to establish a set of base rules for fault diagnosis, the main hyperparameter of focus is the number of trees, to ensure lessening the complexity as much as possible, as well as minimizing each tree’s depth if possible.
A random grid of hyperparameters is created by applying a random search over a predefined range for each parameter separately, i.e., the number of trees is predefined to range between 1 and 1000, and only 100 possibilities are selected from the range to form the random grid for this hyperparameter, followed by RF training using one of the randomly selected pair of features at a time. The selection is validated using threefold cross validation to calculate the accuracy of the RF model over a particular set of hyperparameters. A set of 100 randomly selected parameters are used to create the grid, which means 300 RF model trainings have been successfully executed, considering the threefold cross validation over the set of 100 possibilities in the grid. Finally, the set of hyperparameters with the highest accuracy when applying threefold cross validation is the one selected to generate the diagnostic rules.
The RF model trained after applying feature selection with randomly chosen parameters yielded an accuracy of $0.9865\%\text{}$ using 100 decision trees forming the RF, with a maximum depth of five. However, the best hyperparameter tuned using cross validation over the random search grid improved the accuracy by $0.32\%\text{}$ to reach $0.9865\%$. Additionally, using only 49 trees in total instead of 100 over the same depth has dramatically decreased the complexity and size of the generated RF rules, while increasing the accuracy and speed of the diagnosis.
The best hyperparameters selected from the grid have 49 estimators, a minimum sample split of two, a minimum leaf samples of one and maximum depth of 83. It is worth mentioning that the accuracy of the RF using all the best hyperparameters increased the accuracy to $0.99\%$, but this slight rise in the accuracy is not worth it, especially when it is compared to the massive increase in the time and size of each tree in the RF, due to the large increase in the maximum depth.
Figure 3 shows one of the decision trees in the RF after feature selection and hyperparameter tuning.
Each tree in the RF can be translated into a set of nested ifelse statements of rules. Moreover, the formed dynamic rules from the RF can be fed inside various system models to generate a hybrid approach out of the datadriven and the modelbased ones. The dynamic rules extracted from the RF can be used as they are, and can be converted to SQL queries if the model is a relational database or SPARQL queries if the system model is represented by a semantic knowledgebase, such as ontologies.
The diagnostic rules can be extracted from the RF dynamically, using a few lines of code in Python language. For the sake of simplicity,
Figure 4 shows the tree in
Figure 3, pruned in a way that only the positive part of the condition after the root node is remaining, connected to a series of nested statements showing how this part of the tree is translated into clear dynamic rules.
The work in [
7] provided a graphbased FDD system for industrial systems using a modelbased approach, based on creating a knowledgebase of the system under diagnosis, such as ontologies. Followed by manually feeding a set of static diagnostic rules created by the system expert, into the ontology, in a way that forms a causation relationship between the system sensors and the faults they lead to. In this work, we propose creating dynamic rules using RF, extracting these rules and feeding them into the ontology instead of the expert rules that are static, unreliable and unverifiable. Furthermore, the extracted diagnostic rules using RF can be applied in a variety of forms to fit the model expressing the system.
Figure 5, showcases the extracted rules from the optimized, hyperparameter tuned RF, and how these rules are transformed into various forms and types to match the nature of the system model. As mentioned before in this chapter, the diagnostic rules can be translated into SQL queries in cases wherein a relational database is the system model, or SPARQL queries if a semantic knowledgebase, such as ontologies, are used to represent the system. The rules extracted from each tree may be scheduled separately, or all together with the trees forming the RF.