Bearing Fault Classiﬁcation of Induction Motors Using Discrete Wavelet Transform and Ensemble Machine Learning Algorithms

: Bearing fault diagnosis at early stage is very signiﬁcant to ensure seamless operation of induction motors in industrial environment. The identiﬁcation and classiﬁcation of faults helps to undertook maintenance operation in an e ﬃ cient manner. This paper presents an ensemble machine learning-based fault classiﬁcation scheme for induction motors (IMs) utilizing the motor current signal that uses the discrete wavelet transform (DWT) for feature extraction. Three wavelets (db4, sym4, and Haar) are used to decompose the current signal, and several features are extracted from the decomposed coe ﬃ cients. In the pre-processing stage, notch ﬁltering is used to remove the line frequency component to improve classiﬁcation performance. Finally, the two ensemble machine learning (ML) classiﬁers random forest (RF) and extreme gradient boosting (XGBoost) are trained and tested using the extracted feature set to classify the bearing fault condition. Both classiﬁer models demonstrate very promising results in terms of accuracy and other accepted performance indicators. Our proposed method achieves an accuracy slightly greater than 99%, which is better than other models examined for the same dataset.


Introduction
Among rotating machinery, induction motors (IM) are broadly used in manufacturing industries such as transportation, petrochemicals, and power systems due to low cost, high reliability, robust design, and higher efficiency under full load. Generally, IMs need to remain operative for a long duration and commonly under harsh operating environments accompanied with regular wear, which cause mechanical and electrical stresses. These can lead to unexpected failure in gears and bearings, which are the significant machine components of IM. Such failures could cause financial losses or human causalities. Therefore, machine health monitoring and fault analysis become an integral part of the maintenance system in an industrial environment. A robust condition monitoring system can decrease maintenance charges, improve productivity, and increase reliability and safety. In recent times, the machine learning (ML)-based fault analysis approaches are turned out as a powerful and prevalent approaches in the area of continuous health monitoring of rotating machineries, as they have the ability to extract valuable information from the considerable amount of historical data [1][2][3].
Rotor faults, stator faults, rolling element bearing (REB) faults, and other faults commonly occur in IMs, among them bearing faults are most frequent. Statistics shows that 40% of total breakdown situation in large size machines occurred due to bearing fault and the number is as high as 90% in case of small machines [4]. The main elements of a bearing are the inner, and outer ring, rolling element, information. Generally, the running machine contains many non-stationary components because of environmental change and faults from the machine itself. Therefore, it is significant to evaluate the signals with non-stationary type, with the help of various time-frequency analyses, such as short-time Fourier transform (STFT), the Wigner-Ville distribution (WVD), and wavelet transform. Application of this technique revealed both the time and frequency domain information necessary for investigation.
In machine learning-based fault analysis, signal processing is generally used for signal conditioning and feature extraction and include cyclic spectral analysis [19], statistical analysis, [20] wavelet analysis [21][22][23], Hillbert-Hung transform [24], and correlation [25]. Therefore, because of the exclusive characteristics of wavelet analysis, it is used widely for analyzing non-stationary signals. It is used for fault diagnosis in gears and bearings [26,27], locating fault and crack size determination in different structures and components [28,29]. In detection and extraction of features for fault classification, many researchers reported successful implementation of wavelet transform [30][31][32]. Various types of faults in the power system [33] are successfully differentiated from three-phased voltage signal decomposition up to only 4th level using DWT. Although many variations of wavelet functions exist, it is crucial to choose a suitable wavelet to find out the best match and extract the most suitable features. After the initial signals are transformed into a compact relevant representation, they are act as input of a classifier to train and improve the decision function. Among the various processes of classification, Support Vector Machine (SVM) and Artificial Neural Network (ANN) are mostly implemented for machine fault detection and identification [34][35][36][37]. Now a days, extreme learning machine (ELM), a neural network having single hidden-layer-feedforward has been implemented [38] for fault detection and classification and has a very fast learning rate and higher accuracy for prediction in comparison with SVM and ANN. Recently, deep learning (DL) approaches have been considered for fault diagnosis. The DL technique consists of multiple levels of non-linear operation and can automatically learn up to high-level features to allow decision-making more intelligently. DL methods, such as the Deep Belief Networks (DBN), Stacked Auto Encoders (SAE), and Convolutional Neural Network (CNN), have been investigated recently in fault diagnosis [39][40][41][42]. Despite attaining an effective solution from ML techniques, these methods often become stacked in a local minimum, if the configuration parameters are not efficiently considered [43]. In recent times, researchers have started to apply a modernized approach known as ensemble learning to avoid the drawbacks of ML approaches such as feature selection, incremental learning, class-imbalance data, as well as learning concept drift from non-stationary distributions [44]. Ensemble learning is one example of the ML prototypes where multiple learners need to be trained for solving a single problem. Moreover, this method provides better generalization to adapt to any unknown case, better efficiency of avoidance from local minima, and greater search abilities than any ordinary ML approach. Random forests and extreme gradient boosting (XGBoost) are two well-known ensemble ML methods proposed by Breiman [45] in 2001 and Dr. Chen Tianqui [46] in 2014, respectively. The XGBoost algorithm possesses low computational complexity, high accuracy, and fast running speed for any input data set size due to utilization of the central processing unit (CPU) with multi-threaded parallel computing. These methods are effectively used in various fields, such as prediction of environmental condition, detection of medical symptoms, and diagnosis of machine faults [47][48][49].
In this study, we used three types of discrete wavelet transform (DWT) for signal decomposition. Later, statistical features were calculated from the high-level approximation coefficients and the detail coefficients to reduce the feature matrix dimension. After that, to estimate the performances, two ensemble learning algorithms, RF and XGBoost, were trained to classify. Finally, a comparison is made with some recent works on motor current signal, where Lessmeier et al. [50] applied particle swarm optimization based support vector machine (SVM-PSO) and in [51,52], the authors used deep learning based approaches for IM fault classification.
Therefore, the main contributions of this paper can be listed as follows: • Evaluate the performances of various wavelets for motor current signal analysis.

•
Observe the effect of denoising the current signal on classification accuracy. • Evaluate the performance of two ensemble classifiers for bearing fault identification of an IM.
The remainder of this paper is organized as follows. Section 2 provides a description of the experiment set up and characteristics of the data. Section 3 describes the overall process of the method utilized in this paper. Section 4 explains the experimental results and performance of the proposed approach using the evaluation parameters from the dataset, and Section 5 concludes the paper.

Experimental Test Rig and Data Description
The data set used in this work for is collected from the bearing datacenter administrated by faculty of mechanical engineering, Paderborn University, Germany. The overall data were acquired from 32 experimental bearings and were classified into healthy, artificially damaged bearings, and real damaged bearings with an accelerated lifetime test. The testbed is shown in Figure 1 and consists of a permanent magnet synchronous motor, operated by a frequency inverter with a switching frequency of 16 kHz. Along with motor currents and vibration, other supportive measurements named as speed, torque, temperature, and radial load are also available [50].


Observe the effect of denoising the current signal on classification accuracy.  Evaluate the performance of two ensemble classifiers for bearing fault identification of an IM.
The remainder of this paper is organized as follows. Section 2 provides a description of the experiment set up and characteristics of the data. Section 3 describes the overall process of the method utilized in this paper. Section 4 explains the experimental results and performance of the proposed approach using the evaluation parameters from the dataset, and Section 5 concludes the paper.

Experimental Test Rig and Data Description
The data set used in this work for is collected from the bearing datacenter administrated by faculty of mechanical engineering, Paderborn University, Germany. The overall data were acquired from 32 experimental bearings and were classified into healthy, artificially damaged bearings, and real damaged bearings with an accelerated lifetime test. The testbed is shown in Figure 1 and consists of a permanent magnet synchronous motor, operated by a frequency inverter with a switching frequency of 16 kHz. Along with motor currents and vibration, other supportive measurements named as speed, torque, temperature, and radial load are also available [50]. The test rig was driven under several working conditions to diversify the dataset. The working conditions are listed in Table 1. The damage size, location, geometry, and occurrence in the test rig followed the ISO 15243(2010) standard. In our study, among the 32 bearings, we considered the motor phase currents of two phases (CS1 and CS2) from 17 bearings which include healthy bearing and those with inner and outer race faults as mentioned briefly in Table 2. The current signal is recorded with a current transducer (LEM CKSR 15-NP). For each setting, there are 20 files containing 4 s of data, saved in mat file format. In our The test rig was driven under several working conditions to diversify the dataset. The working conditions are listed in Table 1. The damage size, location, geometry, and occurrence in the test rig followed the ISO 15243(2010) standard. In our study, among the 32 bearings, we considered the motor phase currents of two phases (CS1 and CS2) from 17 bearings which include healthy bearing and those with inner and outer race faults as mentioned briefly in Table 2. The current signal is recorded with a current transducer (LEM CKSR Appl. Sci. 2020, 10, 5251 5 of 21 . For each setting, there are 20 files containing 4 s of data, saved in mat file format. In our analysis, we considered the current signal of 1 s rather than the whole 4-s duration for both phases. The sampling frequency was set at 64 kHz; therefore, the initial dataset is a matrix of size 2720 × 64,000. Time domain representation of current signals recorded from three bearings (healthy bearing, bearing with inner ring fault, and bearing with outer ring) are provided in Figure 2. The time domain signal depicts very subtle changes among the three signals. The frequency spectrum is presented in Figure 3, which reveals that each signal contains a 60 Hz component. This is the line/characteristic frequency. We evaluated the classifier performance with and without this 60 Hz component, which will be discussed in a later section.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 22 analysis, we considered the current signal of 1 s rather than the whole 4-s duration for both phases. The sampling frequency was set at 64 kHz; therefore, the initial dataset is a matrix of size 2720 × 64,000. Time domain representation of current signals recorded from three bearings (healthy bearing, bearing with inner ring fault, and bearing with outer ring) are provided in Figure 2. The time domain signal depicts very subtle changes among the three signals. The frequency spectrum is presented in Figure 3, which reveals that each signal contains a 60 Hz component. This is the line/characteristic frequency. We evaluated the classifier performance with and without this 60 Hz component, which will be discussed in a later section.

Methodology
The main objective of this work is to explore the appropriateness of ensemble learning to motor bearing fault analysis using the current signal. To create a relevant feature matrix to train the classifiers, the raw current signals are passed through a notch filter and decomposed using three wavelets.
The purpose of using three wavelets is to observe which wavelet decomposition provides a better performance for feature extraction. The feature matrix is constructed by computing 11 features from the wavelet coefficients. Finally, the classifier model is validated by a cross-validation method. Several hyperparameters for each classifier are also determined to ensure optimum performance. A generalized workflow is presented in Figure 4.

Methodology
The main objective of this work is to explore the appropriateness of ensemble learning to motor bearing fault analysis using the current signal. To create a relevant feature matrix to train the classifiers, the raw current signals are passed through a notch filter and decomposed using three wavelets.
The purpose of using three wavelets is to observe which wavelet decomposition provides a better performance for feature extraction. The feature matrix is constructed by computing 11 features from the wavelet coefficients. Finally, the classifier model is validated by a cross-validation method. Several hyperparameters for each classifier are also determined to ensure optimum performance. A generalized workflow is presented in Figure 4.

Methodology
The main objective of this work is to explore the appropriateness of ensemble learning to motor bearing fault analysis using the current signal. To create a relevant feature matrix to train the classifiers, the raw current signals are passed through a notch filter and decomposed using three wavelets.
The purpose of using three wavelets is to observe which wavelet decomposition provides a better performance for feature extraction. The feature matrix is constructed by computing 11 features from the wavelet coefficients. Finally, the classifier model is validated by a cross-validation method. Several hyperparameters for each classifier are also determined to ensure optimum performance. A generalized workflow is presented in Figure 4.

Bearing Fault Signatures
The rolling element bearing (REB) is an important component in rotating machines and generally carries heavy loads with expectations to operate at high reliability and efficiency. Usually, the inner ring and outer ring of the bearing are mounted on a rotating shaft and stationary housing, respectively. The balls, tapered rollers, cylindrical rollers, and barrel rollers are the rolling elements, enclosed in a cage with equal spacing. To allow the rolling element to contact the ring at a single point, the radii are slightly smaller than the track of rotation and help to distribute the load to a very small surface. General representation of rolling element bearing, and two different types of fault analyzed in this study are represented in Figure 5.

Bearing Fault Signatures
The rolling element bearing (REB) is an important component in rotating machines and generally carries heavy loads with expectations to operate at high reliability and efficiency. Usually, the inner ring and outer ring of the bearing are mounted on a rotating shaft and stationary housing, respectively. The balls, tapered rollers, cylindrical rollers, and barrel rollers are the rolling elements, enclosed in a cage with equal spacing. To allow the rolling element to contact the ring at a single point, the radii are slightly smaller than the track of rotation and help to distribute the load to a very small surface. General representation of rolling element bearing, and two different types of fault analyzed in this study are represented in Figure 5. Therefore, the cage isolates the rolling elements to avoid bad lubrication surroundings through the operation. A large variety of REBs exists, including the deep groove ball bearings, and are mostly used in home appliances, industrial equipment, and automotive applications. Various types of damage such as pitting, spalling, waviness, and misaligned races generally occur due to abrasive wearing, improper installation, material fatigue, manufacturing error, and so on.
Generally, each bearing element obtains a representative frequency. When damage occurs on a bearing element, the interaction of defects produces pulses of very small duration, which result in a rise in vibration energy at that particular frequency. The damage frequency can be determined with the help of the geometry of the bearing and element rotational speed from Equations (1) where β, Nball, Dball, Dcage and fm represent the contact angle of the balls, the number of balls or cylindrical rollers, the diameter of the ball, the cage diameter (also known as the roller or ball pitch diameter), and the rotational frequency, respectively. The detailed ball bearing geometry was described in [53]. Therefore, the cage isolates the rolling elements to avoid bad lubrication surroundings through the operation. A large variety of REBs exists, including the deep groove ball bearings, and are mostly used in home appliances, industrial equipment, and automotive applications. Various types of damage such as pitting, spalling, waviness, and misaligned races generally occur due to abrasive wearing, improper installation, material fatigue, manufacturing error, and so on.
Generally, each bearing element obtains a representative frequency. When damage occurs on a bearing element, the interaction of defects produces pulses of very small duration, which result in a rise in vibration energy at that particular frequency. The damage frequency can be determined with the help of the geometry of the bearing and element rotational speed from Equations (1)-(4).
Outer race defect : Inner race defect : Ball defect : Cage defect : where β, N ball , D ball , D cage and f m represent the contact angle of the balls, the number of balls or cylindrical rollers, the diameter of the ball, the cage diameter (also known as the roller or ball pitch diameter), and the rotational frequency, respectively. The detailed ball bearing geometry was described in [53]. Characteristic fault frequencies induce the current signals and oscillations because bearing damage changes the radial motion between the rotor and stator. Therefore, fluctuations of rotating eccentricity Appl. Sci. 2020, 10, 5251 8 of 21 and load torque also occur as bearing damage produces radial displacement of the rotor relative to the stator and results in amplitude, phase, and frequency modulation of the motor-current signals.
The bearing fault motor current equation is described in [54] as: where ϕ and ω c k represent the phase angle and angular velocity, respectively, and Here, f bearing , p, and f s denote harmonic frequency of the fault current, the pole pair number of the corresponding machine, and fundamental frequency, respectively. The fundamental frequency is referred to as the electrical supply frequency. This is the frequency of three phase power supply connected to the stator of the induction motor. Hence, m = 1, 2, 3, . . . are the harmonic indexes; and f v can be f inner , inner race frequency, or f outer , outer race damage frequency. However, the noise frequency and harmonics produced by the bearing fault become very close or tend to overlap each other, complicating bearing fault detection [55]. In our study, we analyzed three types of bearing conditions: healthy, inner race fault, and outer race fault.

Wavelet Transform
Though frequency information in Fourier transform is extracted for a whole duration of the signal, it is normally determined by calculating the average over the complete length of the signal, which is a major drawback of Fourier transform [56]. To overcome this problem, many time-frequency domain approaches have been used, such as Gabor transform, short-time Fourier transform (STFT), Wigner-Ville transform, and wavelet transform. Among them, wavelet transform is based on Fourier transform and STFT to allow transformation of the time domain signal into time-frequency domain. In wavelet transform, small signals are mathematically integrated into one complete signal, and the small signals are known as wavelets. The wavelet is a short duration oscillation starting and ending at zero. There are several orthogonal basis functions which are known as the mother wavelets. Number of other wavelets are created from wavelet by scaling (stretching or shrinking) with the mother wavelet itself. The shifting of scaled wavelets along the time axis provides information about localization of different frequency contents of the corresponding signal. There are two variants of wavelets namely continuous wavelet transform (CWT) and discrete wavelet transform (DWT).
The CWT is defined in [57] as where a, τ, and ψ represent the scale parameters, translation parameter, and mother wavelet, respectively. ψ* is the complex conjugate of ψ.
The DWT is derived from discretization of CWT (a,b) as Generally, a multiresolution analysis decomposes the signal into a smoother version of the original signal (approximations) and a set of detailed information at various scales. Here, j and k can be any positive integer values such as 1,2, 3, . . . The high frequency and low frequency components are known as detail coefficients (cD) and approximate coefficients (cA), respectively. This process is performed using a series of high and low pass filters and can be expressed as where A j and D j represent the low frequency bands (approximations) and high frequency bands (details) of the signal, respectively. At the transient state, the high frequency components will be evaluated to analyze the signal. In addition, the DWT exposes aspects of data such as discontinuities of higher derivatives, breakdown point, and self-similarity in a more efficient manner than any other signal processing technique. A number of mother wavelets exists for both CWT and DWT all of them are not equally applicable for any signal. It is the nature of signal (e.g., image, time-series data) and field of application which influences which wavelet to be used. Researchers need to look for the wavelet function that best correlates with the function or signal being analyzed to extract the most effective information. As found in the literature, for time-series signals, Haar, db4, and sym4 wavelets are widely used [23,58], as is represented in Figure 6. Therefore, we explored the effectiveness of these three wavelets in decomposing motor current signal-based fault analysis.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 9 of 22 be any positive integer values such as 1,2, 3, … The high frequency and low frequency components are known as detail coefficients (cD) and approximate coefficients (cA), respectively. This process is performed using a series of high and low pass filters and can be expressed as where Aj and Dj represent the low frequency bands (approximations) and high frequency bands (details) of the signal, respectively. At the transient state, the high frequency components will be evaluated to analyze the signal. In addition, the DWT exposes aspects of data such as discontinuities of higher derivatives, breakdown point, and self-similarity in a more efficient manner than any other signal processing technique. A number of mother wavelets exists for both CWT and DWT all of them are not equally applicable for any signal. It is the nature of signal (e.g., image, time-series data) and field of application which influences which wavelet to be used. Researchers need to look for the wavelet function that best correlates with the function or signal being analyzed to extract the most effective information. As found in the literature, for time-series signals, Haar, db4, and sym4 wavelets are widely used [23,58], as is represented in Figure 6. Therefore, we explored the effectiveness of these three wavelets in decomposing motor current signal-based fault analysis.

Feature Extraction
The dataset at its primary state is quite large, but it is useful in training a classifier model. However, the dimensionality of the input data should be as small as possible to obtain high classification accuracy with low computational resources. Often, based on application and amount of available data, feature extraction can be omitted, and the data are directly used to train the classifier. In the feature extraction technique, the data dimension is reduced from the initial data by transferring it into a smaller and more tractable set of data for future processing. In the raw data set, because of the different types of working conditions and machine inputs, the number of variables becomes very large and will demand higher computing resources for the next step. The main objective of feature extraction is to convert the raw data into a smaller subset of significant variables that efficiently represent the target classes. Thus, feature extraction is a crucial step for easier calculation and reservation of important information for ultimate decision making.

Feature Extraction
The dataset at its primary state is quite large, but it is useful in training a classifier model. However, the dimensionality of the input data should be as small as possible to obtain high classification accuracy with low computational resources. Often, based on application and amount of available data, feature extraction can be omitted, and the data are directly used to train the classifier. In the feature extraction technique, the data dimension is reduced from the initial data by transferring it into a smaller and more tractable set of data for future processing. In the raw data set, because of the different types of working conditions and machine inputs, the number of variables becomes very large and will demand higher computing resources for the next step. The main objective of feature extraction is to convert the raw data into a smaller subset of significant variables that efficiently represent the target classes. Thus, feature extraction is a crucial step for easier calculation and reservation of important information for ultimate decision making.
Before performing feature extraction from the current signals, the line component of the 60 Hz component is removed using a notch filter. Then data from 17 bearing conditions are merged. The raw signal and the filtered signal for three bearing conditions are shown in Figure 7. It is not easy to differentiate between faulty and healthy condition signals from the non-filtered signals by visual inspection, but the filtered version is more discernable.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 10 of 22 Before performing feature extraction from the current signals, the line component of the 60 Hz component is removed using a notch filter. Then data from 17 bearing conditions are merged. The raw signal and the filtered signal for three bearing conditions are shown in Figure 7. It is not easy to differentiate between faulty and healthy condition signals from the non-filtered signals by visual inspection, but the filtered version is more discernable. The filtered signal is decomposed up to 11 levels with each of the Haar, db4, and sym4 wavelets. From the 11-level decomposition, we obtained the detailed coefficients from level 1 to 11 (cD1 to cD11) and one approximation coefficient (cA) for each wavelet. The decomposed signals are presented in Figures 8-10.
Due to the high dimension of data containing an excessive amount of information, wavelet decomposition coefficients cannot be directly used as the input of classifiers. To reduce the data dimension and extract the most important characteristic information, feature extraction methods are applied to the wavelet coefficients at every decomposition level.
In this study, 11 statistical features are used for feature extraction from 11 approximation coefficients and the detail coefficient of each wavelet. The initial dataset has 2720 instances containing three different conditions of bearing (healthy, inner race fault, and outer race fault). Each instances is essentially a current signal having a duration of 1 s, sampled at 64 KHz. In the feature extraction stage, each instance is decomposed to 11 levels. Therefore, after decomposition for each instance there are 11 detail and 1 approximation coefficients (total 12 coefficients). Thus, for each coefficient, the features mentioned in Table 3 are calculated, which is 12 × 11 = 132 features. Finally, after adding the label at the last column, we have a feature matrix of 2720 × 133. The filtered signal is decomposed up to 11 levels with each of the Haar, db4, and sym4 wavelets. From the 11-level decomposition, we obtained the detailed coefficients from level 1 to 11 (cD1 to cD11) and one approximation coefficient (cA) for each wavelet. The decomposed signals are presented in Figures 8-10.
Due to the high dimension of data containing an excessive amount of information, wavelet decomposition coefficients cannot be directly used as the input of classifiers. To reduce the data dimension and extract the most important characteristic information, feature extraction methods are applied to the wavelet coefficients at every decomposition level.
In this study, 11 statistical features are used for feature extraction from 11 approximation coefficients and the detail coefficient of each wavelet. The initial dataset has 2720 instances containing three different conditions of bearing (healthy, inner race fault, and outer race fault). Each instances is essentially a current signal having a duration of 1 s, sampled at 64 KHz. In the feature extraction stage, each instance is decomposed to 11 levels. Therefore, after decomposition for each instance there are 11 detail and 1 approximation coefficients (total 12 coefficients). Thus, for each coefficient, the features mentioned in Table 3 are calculated, which is 12 × 11 = 132 features. Finally, after adding the label at the last column, we have a feature matrix of 2720 × 133. Appl. Sci. 2020, 10, x FOR PEER REVIEW 11 of 22 3. Skewness k = 25, …, 36  Table 3. The formulae used for feature extraction from the current signal (x represents the signal vector).

Ensemble Learning
In recent times, selection of an efficient ML algorithm has become crucial for good performance of any fault diagnostic model. Ensemble learning algorithms are one of the approaches that provide better performance than any single prediction algorithm [59]. In ensemble learning, multiple weak learners work together to produce a better model achieving higher accuracy. Some ensemble learning algorithms such as XGBoost may achieve higher accuracy than artificial neural networks and degree-day ordinary least square regression for energy model loss estimation [60]. In our study, two ensemble learning algorithms known as random forest and XGBoost were investigated for fault classification of IM.

Random Forest
The Random forest (RF) algorithm is a collection of decision tree, where each tree is individually trained on an arbitrarily selected independent training dataset. Here, the respective input dataset for each tree is sampled separately, and the distribution rate is the same for all trees. RF not only performs well in classification and regression, but also shows outstanding performance in variable selection. The trees are obtained from the combination of datasets with bootstrap subsampling and various subsets of features for splitting at every node. The nature of each tree is unique and possesses low bias when mature. Also, low correlation is obtained based on selection of random subsets of features for the individual trees. Finally, after ensemble of all the trees, the RF results in low bias and low variance for the model. Bootstrap aggregating from bagging is designed to increase the stability and accuracy for individual trees in RF [48]. For decision making, the class which receives majority vote from the trees is selected in classification problem. On the other hand, the average of the predicted values from all decision trees is considered in regression models. The RF can also overcome the overfitting problem, which is one of the main concerns in a decision tree algorithm. RF uses a bagging technique, where each time a random subset of feature is used to train a single decision tree and it aggregates the result of a number of decision trees to determine the final output. Thus, RF is less prone to overfitting. In addition, the parameter's tuning also helps RF to overcome the overfitting problem, which also applied in this study by using a GridSearch technique. The diversity of the tree in the RF is controlled by the number of features. A higher number of features ensures the most highly correlated trees with the cost of high computational power, whereas a lower number of features results in a lack of correlation. The parameters for implementing the RF are number of trees, number of features in every split, maximum depth, and number of sample leaf nodes. Generally, a high number of trees is mandatory for acquiring a steady state solution for both the classification and regression problems. The RF model consists of a splitting process performed by dividing the single node into two or more nodes; whereby the majority voting process decides the final output of the model, as is illustrated in Figure 11a.

Extreme Gradient Boosting (XGBoost)
XGBoost is an effective enactment of a gradient boosting decision tree (GBDT) algorithm. Generally, XGBoosting applies the first and second derivatives, whereas the GBDT uses only the first derivative. The process in which the ensemble helps to merge multiple weak learners to build a single strong learner is known as boosting. In this algorithm, a sequential learning process is on-going, where the present regression tree is more fitted to the residuals (errors) from that of the previous tree, and the new generated tree is further adjusted with the model to update the residuals. This is a continuous learning process, which runs gradually to perform well. Hence, the new regression trees are tending to a maximum correlated to the negative of the gradient of the loss function, which not only improves the flexibility of the algorithm, but also converges on the loss function. The gradient boosting can be expressed as [47]:ŷ where,ŷ i , x i , and K are the predicted response, inputs, and number of functions in the function space T, respectively. To ascertain the most appropriate functions, t k , the functions are introduced as parameters, which will fit the data during training and find the corresponding regions automatically. In this algorithm, the regularization factor Ω(t k ) was added to express the complexity of the tree based on GBDT. Finally, t k is learned by minimizing the following objective function of the training model as: where φ and L(φ) are the model parameter and differentiable loss functions, respectively. The loss function can be either logistic loss or square loss, representing the similarity rate between the training set and the model. Another inevitable characteristic of XGBoost is the shared-memory multiprocessing application programming interface known as OpenMP, which helps to efficiently use all CPU cores in parallel and declaring independent variables at the start of the training process, ultimately decreasing training complexity and computation time. The simpler model of XGBoost tends to show better performance against overfitting. A pictorial representation of the XGBoost algorithm is illustrated in Figure 11b.

Model Evaluation
The extracted features are fed into the two ensemble classifier algorithms, and the performance of the classifiers is evaluated using the metrics listed in Table 4. Here, the true positives (TP) indicates the data points, which are appropriately labeled as faults, whereas the false positives (FP) wrongly labeled as faults to the normal data points. On the other hand, the data correctly labeled as normal is called the true negatives (TN), and when faulty data is mistakenly labeled as normal, is known as false negatives (FN).

Extreme Gradient Boosting (XGBoost)
XGBoost is an effective enactment of a gradient boosting decision tree (GBDT) algorithm. Generally, XGBoosting applies the first and second derivatives, whereas the GBDT uses only the first derivative. The process in which the ensemble helps to merge multiple weak learners to build a single strong learner is known as boosting. In this algorithm, a sequential learning process is on-going, where the present regression tree is more fitted to the residuals (errors) from that of the previous tree, and the new generated tree is further adjusted with the model to update the residuals. This is a continuous learning process, which runs gradually to perform well. Hence, the new regression trees are tending to a maximum correlated to the negative of the gradient of the loss function, which not only improves the flexibility of the algorithm, but also converges on the loss function. The gradient boosting can be expressed as [47]:   (27) Specificity indicates the rate of correct detection of the true negative class, whereas the sensitivity measures the effectiveness for the model to detect events in the positive class. Though type I errors are measured by sensitivity and type II errors by precision, precision and sensitivity typically are stated pairwise. Finally, the F1 score provides the harmonic mean of precision and sensitivity. In addition, the receiver operation characteristics (ROC) curve is presented to evaluate classifier performance.

Hyperparameter Selection
The optimal hyperparameters are determined through a process called hyperparameter search.
The appropriate values of the hyperparameters increases the accuracy of the training model. In order to finding the set of optimal hyperparameters, each independent set is applied by k-fold cross-validation and finally, the most appropriate set of hyperparameters is determined by applying "GridSearchCV", which a scikit-learn class. For both classifiers, a broad range of parameters was tested. The optimum parameters found after the GridSearch method for two ensemble learning are listed in Table 5. Also, we applied 5-fold cross validation to enhance reliability of the output. Finally, with the optimum parameters of the classifiers, the feature matrix was split into training and testing subsets with 80:20 ratio for training and validation of the fault classification abilities for the RF and XGBoost classification algorithms.

Results
After selecting the hyperparameter values, the two classifiers are trained. The training and test ratios were chosen as 80:20. A 5-fold cross-validation approach was utilized to validate the trained models. The overall process was carried out with the non-filtered data, and the parameters mentioned in Equations (23)- (27) were determined, as provided in Table 6. In the next step, the filtered signal from a notch filter is obtained, and the process is repeated. In this approach with the filtered signal, greater than 99% accuracy was achieved for both classifiers. The performance evaluation parameters for the filtered signal are listed in Table 7 for the two ensemble classifiers. The bar chart in Figure 12 indicates the improvement in accuracy after filtering the current signal. In the next step, the filtered signal from a notch filter is obtained, and the process is repeated. In this approach with the filtered signal, greater than 99% accuracy was achieved for both classifiers. The performance evaluation parameters for the filtered signal are listed in Table 7 for the two ensemble classifiers. The bar chart in Figure 12 indicates the improvement in accuracy after filtering the current signal.

Accuracy (%)
Raw Filtered Figure 12. Accuracy for RF and XGB for raw and filtered motor current signals.
The corresponding confusion matrices are presented in Figures 13-15. From the confusion matrices, it is evident that the classifiers are very successfully classifying the faults, as the false positive and the false negative numbers are negligible. Also, in the ROC, the area under the curve is provided in Figure 16 and ensures that the models can distinguish among the classes. The corresponding confusion matrices are presented in Figures 13-15. From the confusion matrices, it is evident that the classifiers are very successfully classifying the faults, as the false positive and the false negative numbers are negligible. Also, in the ROC, the area under the curve is provided in Figure 16 and ensures that the models can distinguish among the classes.      Finally, Table 8 presents a comparison of the other models investigated with the same dataset in different works. Wavelet packet decomposition up to three levels, along with the special SVM approach named SVM-PSO (SVM-particle swarm optimization), was implemented by Lessmeier et al. [50] and achieved an accuracy of 86.03%. Information fusion (IF) and DL approaches were used [51] on a motor current signal, which result almost 98.3% accuracy. Another study on the same dataset applied an empirical wavelet transform and CNN for fault classification, showing 97.37% accuracy [52]. Therefore, by comparison with the recent research on the motor current signal, our approach provides a better result, with greater than 99% accuracy for ensemble classifiers. Finally, Table 8 presents a comparison of the other models investigated with the same dataset in different works. Wavelet packet decomposition up to three levels, along with the special SVM approach named SVM-PSO (SVM-particle swarm optimization), was implemented by Lessmeier et al. [50] and achieved an accuracy of 86.03%. Information fusion (IF) and DL approaches were used [51] on a motor current signal, which result almost 98.3% accuracy. Another study on the same dataset applied an empirical wavelet transform and CNN for fault classification, showing 97.37% accuracy [52]. Therefore, by comparison with the recent research on the motor current signal, our approach provides a better result, with greater than 99% accuracy for ensemble classifiers. Table 8. Comparison of classification accuracy among the various methods.

Conclusions
In recent times, the complexity of the modern industrial system continues to advance because the multi-sensor network has become an essential component in comprehensive systems. Electrical current analysis has emerged as an intelligent solution that simplifies the fault diagnosis process with a small number of sensors. Therefore, the cost is reduced by featuring sensor less and extensive technologies. In this study, a data-driven approach using a motor current signal analyzed by DWT and ensemble machine learning methods was proposed for IM bearing fault diagnosis. The three mother wavelets of db4, sym4, and Haar were applied for signal decomposition. Both raw and filtered signals were used separately as input of wavelet decomposition. Though we performed a high-level decomposition, high and low frequency component properties showed equal importance. The feature set was constructed from the detailed and approximation coefficients, which act as the input of the two ensemble classifiers. Here, the two phased currents of three conditions of the IM bearing were considered. The purpose of this study was to not only attain high accuracy, but also reduce power and computational complexity by eliminating redundant data. Both RF and grid search classifiers achieved greater than 99% accuracy for all three wavelets on the filtered signal, and other evaluation parameters also outperformed. The time frequency domain-based feature extraction was efficient, with very high accuracy and no feature selection. Finally, a comparison with some recent works is presented and indicates that the wavelet decomposition techniques with the ML ensemble classifier algorithms can be a promising model for bearing fault classification.