Condition Monitoring of Bearing Faults Using the Stator Current and Shrinkage Methods

: Condition monitoring of bearings is an open issue. The use of the stator current to monitor induction motors has been validated as a very advantageous and practical way to detect several types of faults. Nevertheless, for bearing faults, the use of vibrations or sound generally o ﬀ ers better results in the accuracy of the detection, although with some disadvantages related to the sensors used for monitoring. To improve the performance of bearing monitoring, it is proposed to take advantage of more information available in the current spectra, beyond the usually employed, incorporating the amplitude of a signiﬁcant number of sidebands around the ﬁrst eleven harmonics, growing exponentially the number of fault signatures. This is especially interesting for inverter-fed motors. But, in turn, this leads to the problem of overﬁtting when applying a classiﬁer to perform the fault diagnosis. To overcome this problem, and still exploit all the useful information available in the spectra, it is proposed to use shrinkage methods, which have been lately proposed in machine learning to solve the overﬁtting issue when the problem has many more variables than examples to classify. A case study with a motor is shown to prove the validity of the proposal.


Introduction
Induction motors are a fundamental part of many production processes due to their inherent robustness, low cost, and reliability, among other advantages. However, they are not fault-free, with bearings being the component that accounts for the greatest percentage of total failures [1].
The signals that are most frequently used for bearing fault detection are vibration and acoustic noise [2]. However, the use of the stator current to monitor the motor provides some practical advantages related to the simplicity and noninvasive characteristics of the sensors. These advantages are especially relevant in industrial facilities, where some motors can run simultaneously [3,4]. The use of current has proven its effectiveness in detecting faults such as broken bars and eccentricity [2], but in the case of faulty bearings, it faces technical difficulties that hinder its successful implementation. Mainly, the low energy of the vibrations are associated to the fault, which makes it difficult to distinguish in the current spectrum the frequency components related to the fault that may be buried in the noise [1,5,6]. Besides, for inverter-fed motors, the noise is higher and there are other harmonics present in the spectrum, which complicates even more the detection of the faulty related components [7]. Consistently, in [8], denoising techniques are applied to highlight the faulty components in the current spectrum. Other advanced spectral techniques have also been proposed, such as wavelets [9,10], Short-Time Fourier Transform [11], Gabor spectrogram [11] Hilbert-Huang Transform [12,13], Empirical Mode Decomposition [14], Ensemble Empirical Mode Decomposition [15], Modulation Signal Bispectrum [16],

Fault Signatures
When a bearing defect appears, a radial motion between rotor and stator will occur, modifying the airgap of the motor, thus changing the airgap field. These modifications in the airgap can be interpreted as a combination of bidirectional rotating eccentricities [41], which implies that the defect affects the stator current and, therefore, it is possible to monitor it in the current spectra. The radial motion generates harmonics in the stator current at frequencies given by Equation (1) where f 1 is the main supply frequency, n is an entire number, and f v is the vibration characteristic frequency. f v depends on the type of bearing fault (outer or inner race, balls and train defect), with expressions that are a function of the geometry and composition of the bearing [7]. The fault frequencies given by Equation (1) are the result of taking into account the deviations in the main component of the airgap field. When the motor is fed by a power converter, the harmonics level is increased and so is the noise level, hampering the detection of the fault signatures. Nevertheless, the presence of these harmonics can be used to increase the available information, also considering the deviations produced in the fields as a consequence of these harmonics. Even when the motor is directly fed from the line, since the supply is hardly ever perfectly sinusoidal, the number of fault signatures can be increased too, considering the harmonic introduced by the supply. Therefore, Equation (1) can be generalized by Equation (2), where k is the order of the current harmonic.
Considering Equation (2), the number of fault signatures can be increased, resulting in a smaller or larger number of variables depending on the value of k and n. In [7,38], the first sideband around the 5th and 7th harmonics is employed. In this paper, it is proposed to use a larger number of signatures, considering more harmonics, and other sidebands in addition to the first one. In the case study of Section 4, the first eleven current harmonics are used to feed the classifier, considering the first eleven sidebands around those harmonics. As each sideband is composed of two values, and eleven sidebands around eleven harmonics are considered, there are 242 signatures for each characteristic bearing fault frequency, resulting in 968 signatures. Table 1 summarizes the information regarding the proposed fault signatures and the comparison with the traditional approach.

Diagnosis
The next step after selecting the candidate fault signatures is to choose and train the classifier. Many classification algorithms are available, with a wide variety of them already proposed to perform Energies 2019, 12, 3392 4 of 13 diagnosis tasks in induction motors. With the purpose of analyzing the improvement in the performance of the classifier when using the fault signatures presented in the previous section, the MATLAB 2019a (Mathworks, Natick, Massachusetts, U.S.A.) Classification learner App has been used. In this app, there are available different types of classifiers: decision trees, discriminant analysis, logistic regression classifiers, Naïve Bayes classifiers, support vector machines, nearest neighbor classifiers, and ensemble classifiers. In each group, there are several classifiers available. Using these classifiers, it has been proved (as shown in the results section) the huge increase in performance of all the classifiers when using the 968 fault signatures instead of the usual eight signatures.
However, when using such a high number of signatures, and with a reduced number of tests, the risk of overfitting is certain. Shrinkage techniques allow to make use of all the predictors by shrinking the coefficients towards zero, hence, reducing variance [42]. If applied in linear models (which has the advantage in terms of interpretability of the model), it performs as follows: let x i be the m predictors (or fault signatures in the context of condition monitoring) and y i the response for the n cases of the problem. A linear model tries to estimate the m + 1 coefficients (b 0 , . . . , b m ). Using a least squares fitting approach, b i are selected to minimize (a).
To perform the shrinkage, a second term is added to (a), λ m j=1 b 2 j , which acts as shrinkage penalty. Its influence depends on the value of λ, which is a tuning parameter that increases or decreases the penalty. For higher values of λ, the penalty grows and the coefficient estimated will tend to zero, which implies that the estimation is somehow penalized, sacrificing some of the performance on the training set but with the aim of improving its predictive capacity with future observations. The penalty applies to all the coefficients but the intercept, b 0 , since this term is just an estimation of the mean when the predictors are zero [42].
This way of applying the penalty, so performing the shrinkage on the estimated coefficients is known as Ridge Regression. It has the disadvantage that the shrinkage is applied to all the coefficients but none of them are set to zero, so all the predictors are included in the solution, which for problems with a large number of predictors (as in the problem dealt in this paper) leads to a loss of the interpretability of the model. A way of tackling this problem is to change the penalty term into λ m j=1 |b j |, or in statistical terms, to change an l 2 penalty for an l 1 one [42]. The use of an l 1 norm has the inconvenience of turning the function to minimize into one that is nondifferentiable, although there are available methods to proceed with the minimization, such as proximal gradient ones [43]. This way of considering the penalty gives rise to the method known as Lasso. As opposed to Ridge Regression, with Lasso, some variables are canceled, so perform as variable selection, depending the number of the variables to be selected on the value of λ (as λ grows, less variables are selected).
Lasso was first applied to linear regressions and lately is receiving much attention, being proposed to regularize a wide variety of statistical models [44]. In accordance with Occam's razor principle, simpler models are preferable, as long as they predict well the training data, since they are more likely to generalize well to unseen data [45]. With this principle in mind, Logistic Regression has been chosen as base model in which to apply the shrinkage technique. Logistic regression is adapted to classification problems since has a discrete outcome. It is based on the logistic function given by Equation (4), which is suitable to be used in classification since its outcome can be inferred as a probability since it runs between 0 and 1, and its elongated S-shape offers the advantage that the same additional input influences less the outcome for values near zero or one [46,47]. For binary classification, a threshold value of 0.5 is defined to assign the outcome to one class or the other, which in condition monitoring would be healthy or faulty. When the aim is to distinguish among different states of failure, there are several classes into which the outcome can be classified. This multiclassification is performed via the one-versus-all approach, as represented in the flow chart in Figure 1. This way, several binary classifiers are trained (as many as classes), where each classifier confronts one class against the rest. Finally, the outcome is assigned to the class where the probability is highest.

Test Bench
The tested induction motor is a two pole pair squirrel cage motor, star connected, with a rated power of 0.75 kW at 400 V and a rated current of 1.9 A at a rated speed of 1395 RPM. The tests were performed at two load levels, low (almost no load) and high (rated speed), using a magnetic powder brake. The data were collected using a DAC PCI-6250 M model (16 analogue inputs-16-bit 1 MS/s) of National Instruments and Hall effect sensors of LEM (LEM, Fribourg, Germany). The sampling frequency was 25 kHz, with a sampling time of 10 s (steady state).
Four different supply conditions were considered ( Table 2). The first one (S1) represents the motor directly fed from a 400 V utility supply. Supply S2 is the motor fed by an inverter (ABB) at 50 Hz and with a switching frequency of 4 kHz. For S3, the operating frequency was changed to 25 Hz, and for S4 the switching frequency was established at 5 kHz. The converter operated with an openloop scalar V/Hz control. To initiate the tests, a new SKF Explorer 6004 bearing (SKF; Göteborg; Sweden) was used, performing the corresponding tests to represent the healthy condition. Then, to provoke the progressive wear of the bearing, the lubricant grease was contaminated using silicon carbide, a ceramic material with high resistance to erosion and corrosion, and high thermal cycling. This process was established to simulate industrial environment conditions that lead to the degradation of the bearing, such as inadequate lubrication, overloads, or lubricant contamination (especially relevant to open bearings). During this process, five condition states were defined according to the degradation Binary classification. One condition (positive class) is tested against the others (negative class) Determination of probability that the example belongs to the positive class All the classes have been tested?
The example is classified in the class with the highest probability Other class is selected as positive

Test Bench
The tested induction motor is a two pole pair squirrel cage motor, star connected, with a rated power of 0.75 kW at 400 V and a rated current of 1.9 A at a rated speed of 1395 RPM. The tests were performed at two load levels, low (almost no load) and high (rated speed), using a magnetic powder brake. The data were collected using a DAC PCI-6250 M model (16 analogue inputs-16-bit 1 MS/s) of National Instruments and Hall effect sensors of LEM (LEM, Fribourg, Germany). The sampling frequency was 25 kHz, with a sampling time of 10 s (steady state).
Four different supply conditions were considered ( Table 2). The first one (S1) represents the motor directly fed from a 400 V utility supply. Supply S2 is the motor fed by an inverter (ABB) at 50 Hz and with a switching frequency of 4 kHz. For S3, the operating frequency was changed to 25 Hz, and for S4 the switching frequency was established at 5 kHz. The converter operated with an open-loop scalar V/Hz control. To initiate the tests, a new SKF Explorer 6004 bearing (SKF; Göteborg; Sweden) was used, performing the corresponding tests to represent the healthy condition. Then, to provoke the progressive wear of the bearing, the lubricant grease was contaminated using silicon carbide, a ceramic material with high resistance to erosion and corrosion, and high thermal cycling. This process was established to simulate industrial environment conditions that lead to the degradation of the bearing, such as inadequate lubrication, overloads, or lubricant contamination (especially relevant to open bearings). During this process, five condition states were defined according to the degradation of the bearing, as summarized in Table 3. After assembling the new bearing, 20 tests per supply were run, corresponding to the healthy state (C1). Then, the bearing was first contaminated, and the motor run unloaded for 12 hours to lead the bearing to the "incipient fault" condition (C2). In this condition, 15 tests were performed for each supply. The process of running the motor unloaded and contaminating the grease was repeated, giving way to "intermediate fault" condition (C3), with 15 tests per supply, "developed fault", with 10 tests (C4), and "complete breakdown" (C5), with 10 tests for each supply. Figure 2 presents pictures of the bearing in each of the conditions, showing the evolution of the fault along the tests.  Table 3

Classification with 968 Fault Signatures
In order to show the improvement in the classification when using the whole of the fault signatures, as proposed in Section 2, next the results obtained using the MATLAB 2019a Classification learner App are presented. Five-fold cross validation was used. Table 4 summarizes the results obtained with the App with the accuracy for the classification in each of the five bearing conditions at low and high load. All the algorithms included in the App have been tested, showing the one that has the best performance for each tested case (depending on the load and the supply) and its accuracy. The same procedure has been applied, feeding the algorithms with eight inputs, following the traditional procedure of considering just the first sideband around the vibration characteristics frequencies, according to Equation (2).  Table 3

Classification with 968 Fault Signatures
In order to show the improvement in the classification when using the whole of the fault signatures, as proposed in Section 2, next the results obtained using the MATLAB 2019a Classification learner App are presented. Five-fold cross validation was used. Table 4 summarizes the results obtained with the App with the accuracy for the classification in each of the five bearing conditions at low and high load. All the algorithms included in the App have been tested, showing the one that has the best performance for each tested case (depending on the load and the supply) and its accuracy. The same procedure has been applied, feeding the algorithms with eight inputs, following the traditional procedure of considering just the first sideband around the vibration characteristics frequencies, according to Equation (2). According to the results shown in Table 4, it is very clear that the use of the fault signatures related to a bigger number of sidebands around more harmonics outperformed the use of just eight fault signatures. The huge improvement in the performance was observable in all the cases for the different supplies, operating frequencies, switching frequencies, and loads. The results were in general better for high load, since the energy associated to the harmonics was higher. It is also remarkable that there was a variety of selected algorithms, with Support Vector Machines the most repeated, although, in some cases, Gaussian Naive Bayes, Linear Discriminant, Fine k-Nearest Neighbor and Bagged Trees performed better. This discrepancy added difficulty to the selection of a classifier valid for all the operating conditions. An algorithm that performs well for all the cases would be desirable. Besides, as it was stated earlier, the use of a big number of signatures (much bigger than the number of tests) may lead to overfitting, losing from the trained algorithms the ability to generalize when classifying new observations. To take into account this situation, shrinkage was applied, as explained in Section 3.

Classification with 968 Fault Signatures by Applying Shrinkage
The previous section has shown that the classification improves hugely when more information available in the spectra is considered. In this section, shrinkage methods are applied, with the double purpose of selecting an algorithm with good performance independently of the operating conditions and of avoiding the problem of overfitting (prone to appear due to the high number of fault signatures, much higher than the number of tests). As explained in Section 3, two different types of shrinkage methods were considered: Ridge regression where all the inputs are considered in the classification, and Lasso that performs variable selection (considering a higher or lower number of inputs in the classification, depending on the value of the penalty parameter). A third method was included in the comparison, Elastic nets, which can be considered as a mixture between Lasso and Ridge regression.
To build the algorithms and measure their performance, the data sets for each case study were divided into two different sets, the training set consisting of 70% of the cases, and the test set with the other 30% of the data. Table 5 shows the performance, measured in terms of accuracy, for the three shrinkage methods, for each supply and load. It can be observed that the results were very good for all the cases, although with some differences in the performance among the cases analyzed, as it also happened for the algorithms considered in the previous section. The three shrinkage methods performed well, although in general, Lasso obtained the best accuracy, therefore, if a single method were to be selected, Lasso would be the candidate. In this selection, it has also been taken into account that, since Lasso eliminated variables from the classifier, the model obtained gains in interpretability and computational cost. For that reason, next, a deeper analysis on the performance of Lasso is presented, taking especially into consideration the influence of the penalty parameter, since the number of variables selected (and, consequently, the characteristics of the model) depend on this parameter. As it was stated in Section 3, the main way in which Lasso avoids overfitting is by feature selection, which is controlled by adjusting the regularization parameter λ. The bigger λ, the more parameters bj in Equation (3) will be zero, that is, the corresponding predictors will not be considered when designing the classifier. Therefore, if a high value of λ is chosen, it is much less likely to result in overfitting, besides, the computational cost is highly reduced. The drawback is that, if less predictors are considered, the performance of the classifier will be reduced. Therefore, a trade-off must be reached to select the best value for the regularization parameter, to obtain a good classifier performance with less computational cost. The selection of the value of the regularization parameter has been performed, considering the train set. No additional validation set has been considered, since the number of tests per case study was low and this would have led to training, validating, and test sets with very few data in each one. Figure 3 shows the evolution of the accuracy, depending on the regularization parameter for the test set, and Table 6 shows the selected value for each supply and load condition. The value chosen for the classifier was the highest one that achieved the best accuracy for that supply and load, since for smaller values of λ, the computational cost would be higher. If a unique value were to be chosen for all the supplies, 0.05 could be selected when operating at high load, and 0.02 at low load. If the common value of 0.02 were to be chosen, the performance of the algorithm in this case would be decreased around 5%, although the computational cost would be decreased. The number of variables selected varied approximately between 10 and 20 for the different cases tested, decreasing the number as lambda increased. This selection is the main difference with the other two shrinkage methods; for Ridge regression, as no selection was performed, the number was the starting 968, while for Elastic nets, the number was approximately the intermediate between lasso and Ridge regression.  As it was stated in Section 3, the main way in which Lasso avoids overfitting is by feature selection, which is controlled by adjusting the regularization parameter λ. The bigger λ, the more parameters bj in Equation (3) will be zero, that is, the corresponding predictors will not be considered when designing the classifier. Therefore, if a high value of λ is chosen, it is much less likely to result in overfitting, besides, the computational cost is highly reduced. The drawback is that, if less predictors are considered, the performance of the classifier will be reduced. Therefore, a trade-off must be reached to select the best value for the regularization parameter, to obtain a good classifier performance with less computational cost. The selection of the value of the regularization parameter has been performed, considering the train set. No additional validation set has been considered, since the number of tests per case study was low and this would have led to training, validating, and test sets with very few data in each one. Figure 3 shows the evolution of the accuracy, depending on the regularization parameter for the test set, and Table 6 shows the selected value for each supply and load condition. The value chosen for the classifier was the highest one that achieved the best accuracy for that supply and load, since for smaller values of λ, the computational cost would be higher. If a unique value were to be chosen for all the supplies, 0.05 could be selected when operating at high load, and 0.02 at low load. If the common value of 0.02 were to be chosen, the performance of the algorithm in this case would be decreased around 5%, although the computational cost would be decreased. The number of variables selected varied approximately between 10 and 20 for the different cases tested, decreasing the number as lambda increased. This selection is the main difference with the other two shrinkage methods; for Ridge regression, as no selection was performed, the number was the starting 968, while for Elastic nets, the number was approximately the intermediate between lasso and Ridge regression.    So far, accuracy has been used to measure the performance of the classifier. Obviously, from an algorithmic point of view, it is important to classify all the states correctly, and therefore to achieve the highest possible accuracy. However, from a condition monitoring point of view, some misclassifications are more relevant than others, being especially relevant to predict the first and fifth states correctly, that is, the healthy and completely faulty conditions. With this purpose, Tables 7-10 show the confusion matrices resulting from applying Lasso classifier for the four supplies and two load conditions. It can be observed how for the healthy state, for 48 instances (there were six true healthy states in each of the eight cases) only one case (S3, low load) was misclassified. And even if this case can be considered as a false negative, this instance was classified as an incipient fault, not as a more developed one. In the same way, for the 24 complete faulty cases (three for each of the eight cases) only one was misclassified (again, S3 at low load), being predicted as an intermediate fault. Finally, it is relevant to point out that for all the 168 cases classified, 13 were not correctly classified, but just four of them were classified more than one class away from the true class. Table 7. Confusion matrices for supply S1 applying lasso classifier.

Low Load
High Load

Predicted Class Predicted Class
True class  Table 9. Confusion matrices for supply S3 applying lasso classifier.

Predicted Class Predicted Class
True class

Discussion
A procedure for the diagnosis of induction motor bearings has been presented. The main purpose of the proposal is to maintain the good performance of existing methods that use vibrations or sound as inputs, but using the stator current. So far, the monitoring of the current has not achieved as good a performance as the use of the other variables mentioned, but since it has some clear advantages related to the necessary sensors, it is advisable to have a procedure that allows the use of the current. To achieve this goal, it has been proposed to take advantage of more information that can be extracted from the spectra beyond what is commonly used, but without the extra computational cost that other techniques, including parametric and non-parametric methods, usually require. Besides, the proposed method is particularly adequate for inverter-fed induction motors where noisier spectra and significant harmonics and interharmonics are present.
It has been shown that the use of much more information greatly improved the performance of the diagnosis, which has been proved by means of 24 classifiers (available in the Matlab Classification learner app). However, it must be taken into account that detection and diagnosis are interlinked. There is no use in expecting a good diagnosis performance if the fault signatures obtained during the detection process are of a bad quality. Conversely, although there were high informative fault signatures, if the diagnosis stage is badly designed, the whole process will suffer. Besides, the chosen algorithm must be in accordance with the available variables. Therefore, it has been selected a type of classifier that can perform well with the particular conditions of the problem, where there were many more fault signatures than cases to classify. Shrinkage methods have been chosen since they allow to perform in those condition, avoiding the problem of overfitting.
Three shrinkage methods have been compared, Lasso, Ridge regression, and Elastic nets, and all of them have proved to achieve a very good performance in the cases analyzed. Although all three meet the expectations, Lasso has been chosen to analyze its results in greater depth, since this method selects variables, providing simpler and more interpretable models. For the analysis of the performance of Lasso, the confusion matrices for eight different scenarios have been provided and analyzed. Although from an algorithmic point of view it is important to classify all the states correctly, from a maintenance perspective it is especially relevant the presence of false positives or false negatives concerning the healthy and complete fault conditions. That is, some misclassifications are more relevant than others are. For example, wrong predictions between conditions corresponding to intermediate and incipient faults are not likely to have important repercussions but, on the contrary, a misclassification between states healthy and complete fault will surely have further implications. It has been shown that the predictions obtained with the proposed method matched the expectations from a condition monitoring perspective.