Towards Developing an Automated Faults Characterisation Framework for Rotating Machines. Part 1: Rotor-Related Faults

: Rotating machines are pivotal to the achievement of core operational objectives within various industries. Recent drives for developing smart systems coupled with the signiﬁcant advancements in computational technologies have immensely increased the complexity of this group of critical physical industrial assets (PIAs). Vibration-based techniques have contributed signiﬁcantly towards understanding the failure modes of rotating machines and their associated components. However, the very large data requirements attributable to routine vibration-based fault diagnosis at multiple measurement locations has led to the quest for alternative approaches that possess the capability to reduce faults diagnosis downtime. Initiatives aimed at rationalising vibration-based condition monitoring data in order to just retain information that o ﬀ er maximum variability includes the combination of coherent composite spectrum (CCS) and principal components analysis (PCA) for rotor-related faults diagnosis. While there is no doubt about the potentials of this approach, especially that it is independent of the number of measurement locations and foundation types, its over-reliance on manual classiﬁcation made it prone to human subjectivity and lack of repeatability. The current study therefore aims to further enhance existing CCS capability in two facets—(1) exploration of the possibility of automating the process by testing its compatibility with various machine learning techniques (2) incorporating spectrum energy as a novel feature. It was observed that artiﬁcial neural networks (ANN) o ﬀ ered the most accurate and consistent classiﬁcation outcomes under all considered scenarios, which demonstrates immense opportunity for automating the process. The paper describes computational approaches, signal processing parameters and experiments used for generating the analysed vibration data.


Introduction
The past few decades have been characterised by very significant population growths (i.e., from 2.53 billion in 1950 to 7.16 billion in 2011 and predicted to reach 14.4 billion within the next 6 decades) across the world [1]. As a direct consequence of this unprecedented growth, the global primary energy consumption has correspondingly risen from 3701 Mtoe in the mid-1960s to approximately 13511 Mtoe in 2017 [1,2]. Besides the strains exerted by current energy demands on existing energy generation and distribution systems across the world, the sustainability of present scenarios is questionable, owing to the rapid depletion of primary energy resources such as crude oil, coal, and natural gas (estimated to be completely depleted in 50.2, 52.6 and 134 years, respectively) [2]. Additionally, the global upsurge

Experimental Organisation and Data Source
In order to foster a substantial acceptability of any newly proposed technique, an experimental validation is often required, and this is commonly achieved through the aid of an experimental test rig. A representative test rig may be described as that which has been adequately set-up to correctly simulate the investigated faults in the research. Typical industrial rotating machines are complex and multi-components (e.g., rotors, bearings, gears, couplings, electric motors, blades, etc.) structures that often possess multiple failure modes [43]. As it would be unrealistic to examine all faults classes at a single instance, this study only considers some of the most common rotor-related (i.e., low frequency) faults. The remainder of the section presents further details of various components that comprise the test rig as well as the experimental simulation of the studied rotating machine conditions.

Coherent Composite Spectrum (CCS)
Full details of the computational approach of CCS have been described in several earlier studies [7,18,[26][27][28][29]. However, in order for the present study to be independently comprehendible without necessarily consulting other articles, an abridged form of CCS [26][27][28] computation is again provided here. If the number of bearings on a typical industrial rotating machine is b, each of which has a fitted vibration sensor, the measured time-series signals can be divided into n s equal-length Energies 2020, 13, 1394 4 of 20 segments. The power spectral density (PSD) of the signal x p from the pth bearing at the frequency f k can be calculated as [26][27][28]: where X r p ( f k ) is the discrete Fourier transform (DFT) of the rth segment of the signal x p , and X r * p ( f k ) is its complex conjugate, for p = 1, 2, · · · , b. Similarly, the cross-power spectral density (CSD) for the signals x p and x p+1 can be calculated as: The coherence of the signals x p and x p+1 for background noise suppression can be calculated as: The coherent CSD of the signals from the pth and (p + 1)th bearings can then be defined as: S r x p γ 2 p(p+1) x p+1 Therefore, each of the rth segments from each signal can be fused into a single component, X r CCS ( f k ), thus: The CCS for the entire machine can then be calculated as: The S CCS ( f k ) is a sequence of complex numbers that allows for the computation of the single-sided amplitude spectrum of the CCS generated from Equation (6):

Feature Extraction
Earlier CCS based diagnoses have only applied maximum amplitude values of each harmonic during classification. In the current study however, we explore the use of an entirely new feature-spectrum energy (SE). Just as root mean square (rms) gives a representation of the energy content of the time waveform, SE provides information about the energy content of the measured vibration signal in the frequency domain. Considering that the CCS represents several time waveforms becomes evident [30]. SE is particularly useful because it is a universal indicator capable of showing changing trends in vibration data either due to the dynamic characteristics of the machine of interest or emergence of incipient faults. For a typical A CCS ( f k ) computed as per Equation (7) at a frequency f k , where f k = (k − 1)d f , k = 1, 2, · · · , N/2, N is the number of data points and d f is the frequency resolution, the SE between the selected harmonics at intervals of d f can be defined as: 10,15,20) (8)

Dimensionality Reduction
The complexity of industrial rotating machines makes typical vibration data non-linear and highly dimensional. This dimensionality often makes diagnosis difficult, owing to the interference of redundant data points with core information that actually indicates variability. In this study, PCA was used to reduce data dimensionality, while still retaining the data points that aid the differentiation of various machine states [31][32][33]. This is particularly useful because it helps rationalise the amount of data that needs to be analysed, thereby reducing the amount of time required to implement the necessary repair/replace decisions. The implementation of PCA in this study was performed in 2 stages. The former stage involved centralising the datasets while the latter stage involved performing the singular value decomposition (SVD) on the aforementioned centralised data.

Data Centralisation and Standardisation
For a real matrix A ∈ R m×n , where m is the number of samples and n is the number of features (dimensions), PCA requires the centralisation of each column. In the matrix, A ∈ R m×n , a ij represents typical elements of the matrix, while x ij on the other hand is a corresponding element of the centralised matrix X.
To centralise the matrix A, the element x ij of the centralised matrix X is defined as: where A j is the sample mean of the elements of the jth column of matrix A, which can be computed as: It is vital to note that the centralised data is of a column-wise zero mean form, which is essential for the subsequent computations.
To standardise the matrix A, the element x ij of the standardised matrix X is defined as: where S j is the sample standard deviation of the jth column of A, which is mathematically represented as: The standardised data is column-wise mean zero with standard deviation that equals unity, while still retaining the shape properties of the original data.

Singular Value Decomposition (SVD)
SVD here was used to perform PCA on the centralised/standardised data, where the SVD of matrix X is defined as: where the matrix Σ ∈ R m×n is a rectangular diagonal matrix of positive numbers σ i , called the singular values of X. The columns of the matrix U ∈ R m×m are orthogonal unit vectors, and referred to as the left singular vectors of X. Similarly, the columns of the matrix V ∈ R n×n are orthogonal unit vectors and are referred to as the right singular vectors of X.
The covariance matrix C ∈ R n×n can thus be computed as: where U T U = E, andΣ 2 is defined as: By comparing this to the factorisation of the eigenvectors of the covariance matrix C, it can be seen that the right singular vectors V of X are in fact equivalent to the eigenvectors of C. A relationship between the eigenvalues λ i of C and the singular values σ i of X can also be derived thus: The SVD therefore enables the calculation of the score matrix (result) T ∈ R m×n for a PCA, which can be mathematically represented as: By considering only the first L largest singular values and their corresponding singular vectors, the truncated score matrix T L ∈ R m×L can be defined as: where U L ∈ R m×L , Σ L ∈ R L×L , and V L ∈ R n×L .

Supervised Learning
Supervised learning is a specific machine learning category, whereby an algorithm can either learn a pattern or build a model (function) using labelled training data, and subsequently infer new instances based on such earlier learned patterns or models [33][34][35][36][37][38][39]. Solving a specific supervised learning problem requires the following various steps including; data type determination, training dataset collection, input features determination, learning algorithm determination, adjustment of the learning algorithm parameters and learning accuracy evaluation. Considering that this study represents the first application of machine learning to CCS-based faults diagnosis, the authors found it useful to explore a wide range of machine learning algorithms [40,41], so as to compare their performance under each of the considered scenarios and then select the most appropriate with regards to ease of deployment and accuracy. Based on this premise, the five different classes of supervised learning algorithms considered for this study are k-Nearest Neighbours (k-NN), Naïve Bayes classifier, linear support vector machine (SVM), Gaussian SVM, artificial neural networks (ANN) and K-fold cross-validation. The justification for selecting these particular learning techniques is mainly their reasonably straightforward computational approach and verifiable success with non-linear datasets including vibration data. k-NN is an instance-based learning algorithm that assumes that instances in a dataset are in the vicinity of other instances, in feature space, with similar properties [40]. The classification of an object is determined by a "majority vote" of its neighbours, and each object is assigned to the most common class of its k nearest neighbours, where k is minimally small integer. In the case k = 1, the class of the object is solely determined by its nearest neighbour. If we define a set of training data with labels, T = (x 1 , y 1 ), (x 2 , y 2 ), · · · , (x N , y N ) , where x i is the feature vector of each instance and y i = c 1 , c 2 , · · · , c K , i = 1, 2, · · · , N is the corresponding label. For a test sample (x, y), the k instances nearest to x are considered by the k-NN algorithm based on a given distance measure. The set of k nearest neighbours of x is written as N k (x). The label of test sample x can thus be decided by the following decision function: where I is the indicator function.

Naïve Bayes Classifier
The Naïve Bayes classifier is a probabilistic classifier based on Bayes' Theorem, under the assumption of conditional independence between features [40,42]. For the same training set T = (x 1 , y 1 ), (x 2 , y 2 ), · · · , (x N , y N ) with labels y i = c 1 , c 2 , · · · , c K , i = 1, 2, · · · , N, the number of the possible values for x l (l = 1, 2, · · · , n) is given by S l and the number of the possible values for Y is given by K. Firstly, the assumption of conditional independence implies that the joint probability distribution P(X, Y) of the input and output can be calculated by the Naive Bayes classifier, using the conditional probability distribution displayed in Equation (20): Secondly, based on the learned model and by applying Bayes' Theorem, the output label y with the maximum posterior probability can be computed given any input x as:

Support Vector Machine (SVM)
SVM constructs a hyperplane or a set of hyperplanes, in high or infinite-dimensional space, which can be used for classification, regression or other tasks [34,39]. The generalisation error of this classifier can be best reduced when the classification boundary is far away from the nearest training data points. The SVM can be viewed as a constrained quadratic optimisation problem, which can be solved using the method of structural risk minimization [42]. The SVM constructs an optimal separation hyperplane f (x) = 0 between datasets, given by: where W is an N-dimensional vector and b is a scalar. ANNs are mathematical or computational models that imitate the structure and function of a biological neural network such an animal's central nervous system, particularly the brain. The networks can be used to estimate or approximate any non-linear function. These networks have the ability to "learn" and summarise data; through experimental application to known data, which implies that ANNs are able to reasonably act as automatic recognition systems by comparing local situations determined by the complexity of learning and solving practical problems under different scenarios. The simplest and most widespread form of an ANN consists of an input layer, a hidden layer, and an output layer. Generally speaking, an ANN is composed of multiple layers of "neurons". Each layer of neurons has an input and output, where the input is the output from the previous layer. Each layer, Layer(i), is composed of N i (where N i signifies N on the ith layer), and each layer of neurons on N i is composed of network neurons [34,36,40]. The collateral neuron takes the output of the corresponding neuron on N i−1 as its input. The connections between neurons are called synapses and each synapse is assigned a weight, which determines the contribution of the previous neuron on the subsequent ones [34,36,40]. The output y can thus be written as: where f is the activation function, W are the weights and b is the scalar bias term. The weights W are assigned through an iterative training process. The transfer function adopted here is the sigmoid symmetric transfer function. Since the ANN type is backward propagation, scaled conjugate gradient (SCG) was used as learning algorithm as well as for overfitting avoidance.

K-Fold Cross-Validation
Cross-validation is a model validation technique that enables a machine learning algorithm's prediction accuracy to be evaluated in practice [41]. In order to perform K-fold cross-validation, the original dataset is split randomly into K subsets of roughly equal size [41]. Of these K subsets, a single subset is retained to be used as validation data, while the other K − 1 subsets are used as training data. Cross-validation is then repeated K times, with each subset being used exactly once for validation. These K results are then averaged to obtain a single estimation. The most common value used is K = 10.

Experimental Organisation and Data Source
In order to foster a substantial acceptability of any newly proposed technique, an experimental validation is often required, and this is commonly achieved through the aid of an experimental test rig. A representative test rig may be described as that which has been adequately set-up to correctly simulate the investigated faults in the research. Typical industrial rotating machines are complex and multi-components (e.g., rotors, bearings, gears, couplings, electric motors, blades, etc.) structures that often possess multiple failure modes [43]. As it would be unrealistic to examine all faults classes at a single instance, this study only considers some of the most common rotor-related (i.e., low frequency) faults. The remainder of the section presents further details of various components that comprise the test rig as well as the experimental simulation of the studied rotating machine conditions.

The Rig
The experimental rig is primarily made up of two mild steel shafts which are connected by a rigid coupling. The longer shaft has a length of 1000 mm and a diameter of 20 mm, while the shorter has a length of 500 mm and a diameter of 20 mm. The longer shaft end of the entire rotor assembly was then coupled to a 0.75 kW electric motor via a flexible coupling. In addition to connecting the Energies 2020, 13, 1394 9 of 20 rig components, the flexible coupling on the motor end of the rig assembly also serves the function of preventing the transmission of motor-related faults signals. In order to ensure rig balance and acceptable rotor deflections, three similarly machined bright mild steel discs with dimensions of 125 mm × 20 mm × 15 mm (i.e., outer diameter × inner diameter × thickness) were mounted at equal distances on each rotor. The two balance discs mounted on the 1000 mm shaft were located at 300 mm and 190 mm from the flexible coupling and bearing 2, respectively. The third balance disc was placed at mid-span of the 500 mm shaft (i.e., 210 mm from bearings 3 and 4, respectively). The shaft assembly was then supported by 4 SKF-type flange-mounted antifriction ball bearings. Each bearing was secured to its pedestal by four 6 mm-thick bright mild steel threaded bars. Vibration data from each bearing is collected via a 100 mV/g accelerometer mounted at 45 degrees. The rationale behind the diagonally mounted accelerometer is sensor reduction as well as the expectation that both vertical and horizontal responses can be reasonably represented. Figure 2 shows the experimental rig assembly and its core components.
Energies 2018, 11, x FOR PEER REVIEW 9 of 20 each bearing is collected via a 100 mV/g accelerometer mounted at 45 degrees. The rationale behind the diagonally mounted accelerometer is sensor reduction as well as the expectation that both vertical and horizontal responses can be reasonably represented. Figure 2 shows the experimental rig assembly and its core components.

Dynamic Characteristics
Experimental modal analysis is a widely acknowledged design testing and qualification tool across various engineering disciplines [44]. Adequate knowledge about the modal properties of a structure immensely paves way for design upgrades, faults diagnosis and remaining useful life enhancement [45,46]. Based on this premise, the impact-response method of experimental modal analysis was also used here to establish an understanding of the dynamic behaviours of the rig as well as validate the origins of subsequent faults diagnostic [47][48][49][50] features (especially spectral peaks). During the experimental modal testing, the complete rig assembly was excited by an ICP-PCB type instrumented hammer and the corresponding vibration responses were measured using the accelerometers. The first few natural frequencies (by appearance) of the experimental rig assembly were determined to be 47 Hz, 55.54 Hz, 57.98 Hz and 127 Hz. The use of threaded bars to connect bearings to their pedestal provided comparable flexibilities in both vertical and horizontal directions. Hence the natural frequencies in both directions were very similar.

Simulation of Faults
A total of five (i.e., baseline relatively healthy and four faults cases as shown in Figure 3) commonly encountered rotor-related cases were experimentally simulated, so as to cover a reasonably wide range of practical operating conditions of typical industrial rotating machines. It was impossible to achieve a perfect alignment while setting up the baseline case, as often is the situation in real-life scenarios. A summarised description of the mode of experimental simulation for each case, the location of the fault on the rig assembly and the severities are provided in Table 1 for simplification. Considering that several modern-day industrial rotating machines operate at multiple speeds, vibration data were collected at 3 distinct speeds; 1200 rpm (20 Hz), 1800 rpm (30 Hz) and 2400 rpm (40 Hz) for each case. This approach provides the opportunity to extensively understand the dynamics of the studied class of machines under different operating conditions.

Dynamic Characteristics
Experimental modal analysis is a widely acknowledged design testing and qualification tool across various engineering disciplines [44]. Adequate knowledge about the modal properties of a structure immensely paves way for design upgrades, faults diagnosis and remaining useful life enhancement [45,46]. Based on this premise, the impact-response method of experimental modal analysis was also used here to establish an understanding of the dynamic behaviours of the rig as well as validate the origins of subsequent faults diagnostic [47][48][49][50] features (especially spectral peaks). During the experimental modal testing, the complete rig assembly was excited by an ICP-PCB type instrumented hammer and the corresponding vibration responses were measured using the accelerometers. The first few natural frequencies (by appearance) of the experimental rig assembly were determined to be 47 Hz, 55.54 Hz, 57.98 Hz and 127 Hz. The use of threaded bars to connect bearings to their pedestal provided comparable flexibilities in both vertical and horizontal directions. Hence the natural frequencies in both directions were very similar.

Simulation of Faults
A total of five (i.e., baseline relatively healthy and four faults cases as shown in Figure 3) commonly encountered rotor-related cases were experimentally simulated, so as to cover a reasonably wide range of practical operating conditions of typical industrial rotating machines. It was impossible to achieve a perfect alignment while setting up the baseline case, as often is the situation in real-life scenarios. A summarised description of the mode of experimental simulation for each case, the location of the fault on the rig assembly and the severities are provided in Table 1 for simplification. Considering that several modern-day industrial rotating machines operate at multiple speeds, vibration data were collected at 3 distinct speeds; 1200 rpm (20 Hz), 1800 rpm (30 Hz) and 2400 rpm (40 Hz) for each case. This approach provides the opportunity to extensively understand the dynamics of the studied class of machines under different operating conditions. Under each experimental scenario (i.e., under each case and each machine speed), vibration data were obtained at a sampling rate of 10 kHz over a timespan of approximately 120 seconds.

Data Arrangement and Signal Processing Parameters
In this study, vibration data were collected under a total of 15 experimental scenarios. A scenario here represents a specific machine condition (e.g., rotor misalignment) at a specific machine speed (e.g., 40 Hz). Prior to applying any of the tools within the proposed hybrid framework, it was crucial to ensure that the measured vibration data under individual scenarios exhibit comparable signal processing characteristics such as number of data points, frequency resolution, sampling frequency, etc.

Data Arrangement and Signal Processing Parameters
In this study, vibration data were collected under a total of 15 experimental scenarios. A scenario here represents a specific machine condition (e.g., rotor misalignment) at a specific machine speed (e.g., 40 Hz). Prior to applying any of the tools within the proposed hybrid framework, it was crucial to ensure that the measured vibration data under individual scenarios exhibit comparable signal processing characteristics such as number of data points, frequency resolution, sampling frequency, etc.

Data Arrangement for CCS Data Fusion
Data preparation for CCS commences with averaging. In the present work, an "average" refers to single complete CCS calculation, which is then converted into a sample with a number of features, for training, validating, or classification in the subsequent stage for machine learning classification. A two-stage overlap method was used for generating enough averages from the raw data (the higher the number of averages, the higher the similarities between the reconstructed and original signals). As shown in Figure 4, the initial stage of data preparation involves splitting the raw data into segments of 20,000 data points with an overlap of 80%. The next stage is to calculate the ith CCS average of the [2(i − 1) + 1] to [2(i − 1) + 10]th segments from each bearing pedestal using Hanning window. The sampling rate used is F S = 10 kHz and the frequency resolution is d f = 0.5 Hz. The number of raw data points, as well as the number of averages generated for each case is listed in Table 2.
the number of averages, the higher the similarities between the reconstructed and original signals). As shown in Figure 4, the initial stage of data preparation involves splitting the raw data into segments of 20,000 data points with an overlap of 80%. The next stage is to calculate the th CCS average of the [2( − 1) + 1] to [2( − 1) + 10]th segments from each bearing pedestal using Hanning window. The sampling rate used is = 10 kHz and the frequency resolution is = 0.5 Hz. The number of raw data points, as well as the number of averages generated for each case is listed in Table 2.  After the data preparation, Equations (1)- (7) were then use to generate the typical CCS shown in Figure 5. Each spectrum represents a specific case (i.e., Healthy, Bow, Loose, Mlign and Rub) at 2400 rpm (40 Hz). It is useful to reiterate that each CCS is a fusion of all vibration data from all 4 bearings, thereby providing a complete dynamics of the entire machine. During the faults diagnosis, only the first five harmonic components were considered, as it was adjudged that these would contain sufficient information to distinguish faults at these frequencies. Based on the types and magnitudes of the harmonics present alone, the Bow case exhibits much higher amplitudes at all harmonics than the other cases. In the Rub case however, the patterns of harmonics amplitudes can be described as the reverse of the Bow. The higher harmonics (especially 3X, 4X and 5X) had the highest amplitudes as opposed to Bow which had a far superior 1X amplitude. The Loose and Bow cases also displayed similar 1X and 4X harmonic patterns. Unlike in the other cases whereby harmonic patterns sometimes appeared similar but amplitudes may differ and vice versa, Healthy and Mlign cases were immensely similar and almost undistinguishable on all counts. This phenomenon was anticipated because of the low severity of misalignment in Mlign case coupled with the inherent residual misalignment in the Healthy case. In  After the data preparation, Equations (1)-(7) were then use to generate the typical CCS shown in Figure 5. Each spectrum represents a specific case (i.e., Healthy, Bow, Loose, Mlign and Rub) at 2400 rpm (40 Hz). It is useful to reiterate that each CCS is a fusion of all vibration data from all 4 bearings, thereby providing a complete dynamics of the entire machine. During the faults diagnosis, only the first five harmonic components were considered, as it was adjudged that these would contain sufficient information to distinguish faults at these frequencies. Based on the types and magnitudes of the harmonics present alone, the Bow case exhibits much higher amplitudes at all harmonics than the other cases. In the Rub case however, the patterns of harmonics amplitudes can be described as the reverse of the Bow. The higher harmonics (especially 3X, 4X and 5X) had the highest amplitudes as opposed to Bow which had a far superior 1X amplitude. The Loose and Bow cases also displayed similar 1X and 4X harmonic patterns. Unlike in the other cases whereby harmonic patterns sometimes appeared similar but amplitudes may differ and vice versa, Healthy and Mlign cases were immensely similar and almost undistinguishable on all counts. This phenomenon was anticipated because of the low severity of misalignment in Mlign case coupled with the inherent residual misalignment in the Healthy case. In order to enhance the ability to distinguish faults at all speeds, SE feature was applied here based on Equation (8). The computed SEs for the harmonics were then used as inputs to the PCA stage.
Energies 2018, 11, x FOR PEER REVIEW 12 of 20 order to enhance the ability to distinguish faults at all speeds, SE feature was applied here based on Equation (8). The computed SEs for the harmonics were then used as inputs to the PCA stage.

Data Arrangement for Dimensionality Reduction
The SE of the CCS allows for the extraction of two features. Firstly the SE of 1X, 2X, 3X, 4X and 5X; and secondly the normalised harmonics is basically their ratios (e.g., 2X/1X, 3X/1X, 4X/1X and 5X/1X). By centralising and standardising each of these features according to Equations (9)-(12), 4 input data types (i.e., centralised SE, centralised ratio, standardised SE and standardised ratio), were generated for the PCA. The PCA converts these original features into the same number of principal components (PCs). The percentage of explained variance of each component indicates how much information the resultant PC holds. Typically, the first few PCs hold the vast majority of information. Based on this premise, the remaining PCs can, therefore, be discarded to reduce dimensionality of the dataset without necessarily compromising diagnosis quality. Table 3 provides a list of the percentages of explained variance for each PC for the different features at all speeds. Additionally, it is apparent that the combination of PCs 1 and 2 account for more than ¾ of the explained variance for each feature at all speeds.

Data Arrangement for Dimensionality Reduction
The SE of the CCS allows for the extraction of two features. Firstly the SE of 1X, 2X, 3X, 4X and 5X; and secondly the normalised harmonics is basically their ratios (e.g., 2X/1X, 3X/1X, 4X/1X and 5X/1X). By centralising and standardising each of these features according to Equations (9)-(12), 4 input data types (i.e., centralised SE, centralised ratio, standardised SE and standardised ratio), were generated for the PCA. The PCA converts these original features into the same number of principal components (PCs). The percentage of explained variance of each component indicates how much information the resultant PC holds. Typically, the first few PCs hold the vast majority of information. Based on this premise, the remaining PCs can, therefore, be discarded to reduce dimensionality of the dataset without necessarily compromising diagnosis quality. Table 3 provides a list of the percentages of explained variance for each PC for the different features at all speeds. Additionally, it is apparent that the combination of PCs 1 and 2 account for more than 3 4 of the explained variance for each feature at all speeds.  Figure 6 compares the PCA results of all four features at 40Hz so as to examine the stability of SE with PCA alone, so as to justify the need for more advanced machine learning approaches. Of all the speeds considered, only 40 Hz offered reasonable separation for all cases with SE as a feature. This was perhaps due to the enhanced amplitude of vibration at this speed due to its closeness to the first natural frequency. Despite the reasonably good performance of SE at 40 Hz, the use of PCA alone is limited because it is unable to effectively integrate all speeds into a single map. This is due to disparity in amplitude of vibration at different speeds. Hence multiple charts would be required for each speed.  Figure 6 compares the PCA results of all four features at 40Hz so as to examine the stability of SE with PCA alone, so as to justify the need for more advanced machine learning approaches. Of all the speeds considered, only 40 Hz offered reasonable separation for all cases with SE as a feature. This was perhaps due to the enhanced amplitude of vibration at this speed due to its closeness to the first natural frequency. Despite the reasonably good performance of SE at 40 Hz, the use of PCA

Data Analysis and Discussion of Results
The results will be initially examined in the context of the classification accuracy of the various machine learning techniques with regards to all features at all machine speeds. This comparison would give an indication of the most optimised technique-feature combinations under different scenarios of speeds and case. The remainder of the results analysis entails the assessment of visualisation strength of the techniques.

Accuracy Comparison
As previously highlighted, earlier combinations of CCS and PCA for rotor-related faults classification yielded encouraging outcomes, all such efforts were wholly based on manual human observations of perceived patterns of new unlabelled samples. The dominance of human intervention was adjudged a potential weak link especially from the viewpoint of repeatability, which could eventually jeopardise the quality diagnosis when dealing with critical industrial rotating machines. Therefore, implementing an approach that is capable of learning historical patterns and then using such knowledge to perform future classifications would highly enhance reliability. By using PCs 1 and 2 with labels as input data for training and validating the machine learning classifiers as well as K-fold cross-validation method (i.e., K = 10), Table 4 displays a comparison of their accuracies for all features. The results show that feature type has a significant effect on classification accuracy. The features which demonstrated the best performance at the PCA stage (i.e., centralised ratio at 20 Hz, standardised ratio at 30 Hz, and standardised SE at 40 Hz) also represent the best features for automatic fault classification for all the tested machine learning classifiers.

Visualised Decision Rules
In order to further enhance understanding of the classification mechanisms, the 2D decision rules presented in Figures 7-9 were developed. The trained classifiers divide the relevant regions of the PC1-PC2 plane into different sections. Each section of the region corresponds to individual faults. Furthermore, new unlabelled data points will then be assigned to sections that exhibit similar characteristics.
It is important to highlight that, under all three machine speeds examined, instances of severe over-fitting phenomena occurred for both Naïve Bayes and Gaussian SVM classifiers. The sections corresponding to Healthy and Mlign cases are quite minimal which could lead to incorrect classification of future data points. While a similar phenomenon was also evident for linear SVM under the Healthy case at 30 Hz, it demonstrated good outcomes at 20 Hz and 40 Hz speeds. K-NN and the two ANN classifiers displayed very good performance at all speeds. Judging by all-round performance, ANN offered the best results under all scenarios. The ANN architecture adopted here was based on a two-layered backward propagation (BP) network that applied sigmoid hidden and soft max output neurons. This architecture has the capability to classify vectors arbitrarily, provided adequate neurons exist within its hidden layer. The 2-10-5 and 2-20-5 configurations visible in Table 4 respectively represent the input number of neurons for the hidden layer output whereby the two inputs represent PC1 and PC2. The datasets was divided into 70 (for training), 15 (for validation) and 15 (for testing).   It is important to highlight that, under all three machine speeds examined, instances of severe over-fitting phenomena occurred for both Naïve Bayes and Gaussian SVM classifiers. The sections corresponding to Healthy and Mlign cases are quite minimal which could lead to incorrect classification of future data points. While a similar phenomenon was also evident for linear SVM under the Healthy case at 30 Hz, it demonstrated good outcomes at 20 Hz and 40 Hz speeds. K-NN and the two ANN classifiers displayed very good performance at all speeds. Judging by all-round performance, ANN offered the best results under all scenarios. The ANN architecture adopted here was based on a two-layered backward propagation (BP) network that applied sigmoid hidden and soft max output neurons. This architecture has the capability to classify vectors arbitrarily, provided adequate neurons exist within its hidden layer. The 2-10-5 and 2-20-5 configurations visible in Table  4 respectively represent the input number of neurons for the hidden layer output whereby the two inputs represent PC1 and PC2. The datasets was divided into 70 (for training), 15 (for validation) and 15 (for testing).

Conclusions and Future Work
Industrial rotating machines faults detection and classification using vibration-based approaches is widely applied across various industries over several decades. Despite the wealth of knowledge attributable to these classes of approaches, a significant amount of time is still attributable to their field-based implementation, which consequently impacts overall downtime and organisational profit margins. This is perhaps why recent research efforts have immensely focussed on the development of techniques that are capable of rationalising faults diagnosis data through data fusion. Such initiatives include the development of a composite spectrum (CCS) approach that uses a single spectrum to describe entire rotating machine dynamics, irrespective of the number of measurement locations. While initial findings from this approach yielded very encouraging outcomes, there still exist significant opportunities for improvement especially with regard to minimising or eliminating human intervention/subjectivity so as to enhance repeatability.
Building upon the earlier CCS multi-sensor data fusion technique, the current study presents an automatic hybrid approach that uses spectrum energy (SE) as its feature. The proficiency of the new SE feature with various machine learning classifiers with regards to different machine scenarios of speed and conditions is presented. This multi-features/multi-classifiers/multi-scenarios approach allows for adequate performance comparison. It was observed that classifiers, such as Naïve Bayes and Gaussian support vector machine (SVM), displayed instances of severe overfitting, owing to the minimal nature of the sections corresponding to Healthy and Mlign cases which could lead to incorrect classification of future data points. While this situation was also apparent for linear SVM under the Healthy case at 30 Hz, its performance at other speeds was quite good. While K-NN classifiers displayed reasonably good performance at all speeds, ANN offered an all-round best set of results under all scenarios. The findings recorded in the current study demonstrate immense opportunity for automating data fusion-based faults classification of industrial rotating machines. While the current study only examined rotor-related faults, future studies are planned towards testing the presented approach on other classes of faults especially high frequency faults associated with bearings and gears.