Next Article in Journal
Intra- and Inter-Day Reliability of the NIRS Portamon Device after Three Induced Muscle Ischemias
Next Article in Special Issue
The Robust Multi-Scale Deep-SVDD Model for Anomaly Online Detection of Rolling Bearings
Previous Article in Journal
Research on Multi-Level Scheduling of Mine Water Reuse Based on Improved Whale Optimization Algorithm
Previous Article in Special Issue
Partial Transfer Ensemble Learning Framework: A Method for Intelligent Diagnosis of Rotating Machinery Based on an Incomplete Source Domain
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Oversampling Method of Unbalanced Data for Mechanical Fault Diagnosis Based on MeanRadius-SMOTE

School of Mechanical Engineering, Northwestern Polytechnical University, Xi’an 710072, China
*
Author to whom correspondence should be addressed.
Sensors 2022, 22(14), 5166; https://doi.org/10.3390/s22145166
Submission received: 13 June 2022 / Revised: 26 June 2022 / Accepted: 8 July 2022 / Published: 10 July 2022

Abstract

:
With the development of machine learning, data-driven mechanical fault diagnosis methods have been widely used in the field of PHM. Due to the limitation of the amount of fault data, it is a difficult problem for fault diagnosis to solve the problem of unbalanced data sets. Under unbalanced data sets, faults with little historical data are always difficult to diagnose and lead to economic losses. In order to improve the prediction accuracy under unbalanced data sets, this paper proposes MeanRadius-SMOTE based on the traditional SMOTE oversampling algorithm, which effectively avoids the generation of useless samples and noise samples. This paper validates the effectiveness of the algorithm on three linear unbalanced data sets and four step unbalanced data sets. Experimental results show that MeanRadius-SMOTE outperforms SMOTE and LR-SMOTE in various evaluation indicators, as well as has better robustness against different imbalance rates. In addition, MeanRadius-SMOTE can take into account the prediction accuracy of the overall and minority class, which is of great significance for engineering applications.

1. Introduction

With the continuous innovation of technology, industrial equipment has developed rapidly in the direction of large-scale, automated, integrated, and intelligent, such as aircraft engines, steam turbines, wind turbines, centrifuges, etc. In order to meet the requirements of mechanical equipment reliability and precision in the industrial field, PHM (Prognostics and Health Management) was initiated to ensure the stable operation of mechanical equipment and reduce maintenance costs [1,2,3].
With the development of big data in the industrial field, data-driven mechanical fault diagnosis research has received more and more attention [4,5,6]. Mechanical fault diagnosis generally starts by extracting vibration signals from the operation of the equipment, because vibration signals can provide sufficient fault features to reflect the fault status and serve as the input of the prediction model [7,8]. However, due to the low frequency of some faults, the vibration signals of such faults are too small, and the classifier cannot predict them accurately, which is the problem of unbalanced data sets in fault diagnosis. In the multi-classification mechanical fault diagnosis problem, the machine learning classifier emphasizes the accuracy of the overall prediction, which leads to sacrificing the prediction accuracy of the minority class to ensure the prediction of the majority class samples [9]. However, there are infrequent failures in some mechanical equipment, which will lead to huge economic losses once they occur. Therefore, it is necessary to research the problem of unbalanced data sets in mechanical fault diagnosis.
At present, the research on the problem of unbalanced data sets is relatively mature, but this research in the mechanical fault diagnosis field has just begun [10]. Many fault diagnosis techniques rely on reliable and complete data sets, such as multi-sensing fusion techniques [11]. However, since machinery usually operates under normal conditions, it is difficult to collect enough failure data, so that the actual data set lacks completeness [12,13]. The lack of samples with specific labels can lead to data imbalance problems. In recent years, many scholars have begun to pay attention to this problem and have given their own methods [14,15]. Generally, the solution to the problem of unbalanced data sets is mainly divided into data and algorithm aspects, and sometimes they are combined [16].
For the data aspect, scholars mainly use resampling technology to copy, synthesize, delete, and perform other operations on original samples, to adjust the number of samples to reduce the impact of unbalanced data sets. Resampling techniques are divided into oversampling for minority class samples and undersampling for majority class samples. The main idea of oversampling is to increase the number of minority class samples to achieve class balance. The main methods are divided into replicating samples and generating new samples. ROS (Random Oversampling) is to randomly replicate original samples to expand the number of minority class samples, but it may cause the replication of noise samples to affect the quality of the data set [17]. The method of generating new samples derives new samples from one or more original samples, and the new samples can indirectly reflect the features of the minority class. The most classic oversampling is the SMOTE algorithm [18]. The SMOTE algorithm selects the line connecting the two original samples as the range of the new sample and determines a point on the line as the new sample. However, SMOTE still does not avoid the generation of noise samples, and the new samples are easily affected by the distribution of the original samples, which may cause the new samples to deviate from the actual distribution. Later scholars improved SMOTE in terms of noise reduction and generation algorithms, such as Borderline-SMOTE [19], Adasyn [20], LR-SMOTE [21], etc. Undersampling achieves class balance by reducing the number of majority class samples, such as undersampling based on the clustering algorithm and ENN (Edited Nearest Neighbor) [22]. In fact, most of the unbalanced data sets are caused by too few samples in the minority class, so oversampling is the key research in this field [23].
For the algorithm aspect, with the rapid development of machine learning, many classifiers have responded to the problem of unbalanced data sets. On the premise that each sample is equal, the number of samples determines which class the classifier prefers, so setting the weight of the sample, the threshold of the decision boundary, or the objective function of the classifier can strengthen the ability of the classifier to combat unbalanced data sets [24,25]. Adjusting these can make the classifier’s decision boundary less sensitive to the sample size [26]. Moreover, adding a proper regularization term to the objective function can reduce the impact of the imbalance rate on the classifier [27].
There is no universal solution to the problem of unbalanced data sets in mechanical fault diagnosis; although, scholars have tried in various directions. From the perspective of features, extracting more abundant features from vibration signals is beneficial to solving the problem, because the failure can be reflected in the energy of the vibration of the equipment [28]. In addition to features in the time and frequency domains, there are features based on wavelet packet energy and entropy values [29,30], and the fault features are also extracted using a bag-of-visual-word approach from the infrared thermography images [31]. However, the increase of features will undoubtedly increase the workload of feature screening. From the perspective of resampling, scholars use various existing resampling methods to conduct experiments on mechanical equipment [32]. Once there are more failure types or concurrent failures, existing oversampling algorithms may fail. Therefore, analyzing the commonality of mechanical faults and proposing a new oversampling algorithm is the key to solving this problem in the mechanical field [33,34]. From the perspective of the classifier, scholars mainly set the cost matrix, and change the loss function or network structure to make the classifier aware of this imbalance [35]. These classifiers are often only suitable for identifying faults in stationary parts, such as gears or bearings [36].
Although new oversampling algorithms are emerging, there are still the following problems: (1) The solutions are generally only aimed at the prediction of bearing failures or gear failures, so the methods cannot comprehensively diagnose the running state of complete mechanical equipment. (2) Most of the solutions are aimed at the two-category problem, which is obviously not practical. For a simple secondary planetary gear, there are already as many as eight failure types. (3) The new samples are not effective enough that the existing oversampling methods generate. Although the number has reached a balance, it is far from enough in terms of the amount of fault-type information contained in the sample.
In view of the existing problems, this paper improves SMOTE and proposes an oversampling algorithm called MeanRadius-SMOTE, which is specially used to solve the multi-classification problems in mechanical fault diagnosis. MeanRadius-SMOTE can reduce the production of noise samples and add more samples with the ability to affect the decision boundary, and it is easier to inherit the feature information from the original samples. The complexity of the MeanRadius-SMOTE algorithm is not high compared to SMOTE.
The main contributions of this paper are as follows: To solve the problem of multi-classification unbalanced data sets in mechanical fault diagnosis, a new oversampling algorithm, MeanRadius-SMOTE, is proposed. The algorithm takes into account the performance of prediction of overall and minority class, and especially in the minority class, prediction accuracy is greatly improved. In this paper, a large number of comparative experiments are carried out on data sets with various specifications and imbalance rates, and the effectiveness, stability, and robustness of the algorithm are verified.
The rest of this paper is divided into five parts. In Section 2, the SMOTE algorithm and the improved LR-SMOTE algorithm based on SMOTE are introduced. In Section 3, the specific process of the MeanRadius-SMOTE algorithm is introduced in detail. In Section 4, we mainly introduce the source and processing of the data set, as well as the selection of classifiers and evaluation indicators in the experiment. In Section 5, we introduce the experimental process and experimental results. In the following sections, we discuss and summarize the MeanRadius-SMOTE algorithm based on experiments, and we propose future research directions.

2. Related Works

Since the machine learning algorithm is greedy in the face of multi-classification problems, the classifier will give priority to ensuring the highest overall accuracy, resulting in an inaccurate prediction of some minority class samples. In the real industrial field, in the face of some faults with low probability but high maintenance cost, operators hope that the model can accurately predict these faults. Therefore, this section introduces the commonly used methods to deal with unbalanced data sets, namely, the traditional SMOTE method and the improved LR-SMOTE method.

2.1. SMOTE

The SMOTE algorithm was proposed by Chaw La et al. in 2002 [18], and the algorithm is an improved method based on ROS. In the SMOTE algorithm, new samples are generated based on the original samples, which has a greater probability of obtaining effective features than random oversampling of new samples. The steps of the SMOTE algorithm are as follows:
(1)
For each sample x in the training set, calculate their Euclidean distance to each minority class sample xi, and obtain the k nearest neighbors of each minority class sample.
(2)
According to the sample imbalance rate, set the sampling ratio N. For xi, randomly select N samples from its k nearest neighbors, denoted as xh.
(3)
According to Equation (1), build new samples based on xi and xh until the classes are balanced, denoted as xnew.
x new = x i + r a n d ( 0 , 1 ) ( x h x i )
Although the SMOTE algorithm overcomes the overfitting problem of the ROS algorithm, SMOTE still has some problems with noise samples and useless samples. Many scholars have improved SMOTE. For example, Han proposed the Borderline-SMOTE algorithm [19]. The algorithm first classifies the original samples into safe, dangerous, and noise, then uses the dangerous samples to generate new samples. It not only reduces the interference of noise points but also enables new samples to better reflect the features of the data set. However, how to accurately divide the three labels is a more difficult problem for different data sets.

2.2. LR-SMOTE

Based on the SMOTE algorithm, Wang proposed the LR-SMOTE algorithm [21]. The algorithm first uses SVM (Support Vector Machine) and K-means to remove the noise samples in the original data set, then changes the generation rules of new samples and considers the center of the samples to generate new samples. The specific steps of the LR-SMOTE algorithm are as follows:
(1)
Use SVM to classify the data set, and then for the wrongly classified minority samples use the K-means method to judge and remove the noise samples.
(2)
Use K-means to find the center xc of the minority class sample, calculate the distance di from each minority class sample to the center xc, and calculate the average distance dm.
(3)
For each minority class sample xi, calculate the ratio Mi of the average distance dm and the distance di.
(4)
According to the number of the same samples in the neighbor samples, set the weight of each minority class sample, and then randomly select a minority class sample xi and build new samples xnew according to Equation (2).
x n e w = x i + r a n d ( 0 , M i ) ( x c x i )
(5)
Repeat steps 3 and 4 until the number of samples of the majority class and minority class is balanced.
In the LR-SMOTE algorithm, the new samples are generated based on the functional relationship between the sample center and each sample, rather than any two minority samples. Therefore, the new samples will not deviate from the range of the minority samples, and the features are closer to the original sample. LR-SMOTE provides a good direction for generating rules so that the algorithm determines the distribution of samples according to the sample center. This paper also proposes a new algorithm along this way to solve the unbalanced data sets in the mechanical field. We use the MeanRadius-SMOTE algorithm to experiment on a variety of mechanical failure data sets, and the experimental results show that the MeanRadius-SMOTE algorithm is suitable for solving the problem of unbalanced data sets in the mechanical field.

3. Proposed Method

In an oversampling algorithm, new samples at different geometric locations have different improvements in classifier training. In general, the more new samples near the decision boundary, the greater the impact on the classifier. This paper proposes the MeanRadius-SMOTE (MR-SMOTE) algorithm considering the sample center and radius. When using machine learning to predict mechanical failures, we deal with noise samples in advance, so noise reduction is performed in feature preprocessing. Noise reduction is not involved in the MeanRadius-SMOTE, and the noise reduction algorithm will be introduced in the next section.
The MeanRadius-SMOTE algorithm mainly changes the generation rules of the SMOTE algorithm, so that the new samples are more likely to be distributed near the average radius of the minority class samples, and the new samples have a stronger ability to affect the decision boundary of the classifier. In the MeanRadius-SMOTE algorithm, the new sample is determined by k vectors of the sample center to the samples, and the distance between the new sample and the sample center follows a normal distribution. The steps of the MeanRadius-SMOTE algorithm are as follows:
(1)
According to each minority class sample, calculate the geometric center, denoted as the sample center xc of the minority class sample.
(2)
Calculate the Euclidean distance from each minority class sample to the sample center, and then obtain the average distance, denoted as the sample radius dm of the minority class.
(3)
Randomly select k minority class samples, and then obtain k vectors vi from the sample center xc to the samples. Compute the resultant vector of k vectors.
(4)
Use a normal distribution with mean dm and variance d m θ to determine the distance between the new sample and the sample canter. According to Equation (3), build new samples.
x n e w = x c + r i = 0 k v i       r N ( d m ,   d m θ )
(5)
Repeat steps 3 and 4 until the number of samples of the majority class and minority class is balanced.
In order to show the flow of the algorithm more conveniently, we draw the flow chart of the MeanRadius-SMOTE algorithm, as shown in Figure 1.
In the MeanRadius-SMOTE algorithm, k and θ are hyperparameters of the algorithm, which are determined according to the number of minority class samples and the imbalance rate. If k is too large, the direction of the new sample relative to the sample center will become meaningless, and θ directly affects the distribution of the new sample. As shown in Figure 2, new samples under different θ are likely to be distributed in colored areas. When θ is too small, the new sample may be far from the sample center. When θ is too large, the new sample is too conservative and cannot balance the number of positive and negative samples near the decision boundary. Therefore, in general, the selection range of parameters k is 2 to 5 and the selection range of parameters θ is 4 to 10.
For mechanical equipment, some concurrent faults and the original fault have similar vibration states, and the two types of samples often overlap in distribution. Whether the classifier can find an excellent decision boundary is the key to determining the accuracy of the model. In the MeanRadius-SMOTE algorithm, most of the new samples are concentrated around the sample radius to ensure the validity of the new samples. The new sample is determined by k samples and is related to the sample center, so that the new sample can better inherit the features of the minority class. The geometric positions of new samples generated by different oversampling algorithms have their own characteristics, so we plot the examples of SMOTE, LR-SMOTE, and MeanRadius-SMOTE on two-dimension feature samples, as shown in Figure 3. The information of the two-dimension feature samples is shown in Table 1.
The new samples of SMOTE are more inclined to be generated in locations with a high density of the original samples. Since LR-SMOTE randomly chooses a sample to determine the orientation of the new sample, the new sample is more clustered and radial. In MeanRadius-SMOTE, the orientation of new samples is relatively random, and the new samples are generated around the sample radius.

4. Experimental Preparation

4.1. Data Set

Our experimental data set is the 2009 PHM data challenge of gearbox [37]. The data set is a typical industrial gearbox data set, which contains 3 shafts, 4 gears, and 6 bearings, and its experimental bench is shown in Figure 4. The data set tests two sets of gears: spur gear and helical gear. The spur gear data set contains 8 health states, and the helical gear data set contains 6 health states. The data set consists of two channels of accelerometer signals and one channel of tachometer signals. The sampling frequency is 66.67 kHz, and the tachometer signals are collected at 10 pulses per revolution. There are five types of shaft speeds: 30 Hz, 35 Hz, 40 Hz, 45 Hz, and 50 Hz, with high and low loads. In the experiment, we chose the low load spur gear operating data at 30 Hz, and we used the vibration data of the two acceleration channels for feature extraction, The 8 health states of spur gears are as follows in Table 2.
Mechanical equipment frequently fails in the harsh environment of high temperature and high pressure due to concurrent failures composed of multiple single failures [38]. In the PHM dataset, there are many types of concurrent failures, such as labels 4 to 8. They are all combinations of different types of failures in gears and bearings.
For the vibration signal, we sampled the data set using a sliding window with a stride of 100 and a width of 1000. Then we extracted time–frequency domain features for each vibration signal sample and add labels [39]. The formula of 23 features is shown in Table 3.
In the experiment, we used the K-nearest neighbor algorithm to denoise the data set. If the five nearest samples around a sample are not of this class, we consider it to be a noise sample and delete it. After the above preprocessing, we obtained 2656 samples per label, of which 1000 samples per label were taken as the test set. Additional samples were used to construct unbalanced data sets.

4.2. Classifiers

In order to comprehensively evaluate the oversampling algorithm, we chose different classifiers to build the experimental model, which excludes the influence of the classifier and verifies the generality of the oversampling algorithm. Through experiments in a large number of mechanical fault diagnoses, the SVM classifier generally has a good training effect, so we chose SVM to establish a classification model. With the continuous development of the decision tree algorithm, the ensemble learning model is also favored by scholars because of its excellent generalization ability. Therefore, we chose RF (Random Forest) representing bagging ensemble mode, and GBDT (Gradient Boosting Decision Tree) representing boosting ensemble mode for experiments.

4.3. Evaluation Indicators

Traditional evaluation indicators can well evaluate the performance of the model in the two-category problem. However, in the multi-classification problem, due to the partiality of the classifier, these indicators cannot comprehensively evaluate the model on unbalanced data sets. The expectation of the oversampling algorithm in this paper is to improve the prediction performance of the minority class without losing the overall prediction accuracy of the classifier. Therefore, we will use the traditional evaluation indicators and the prediction indicator of the minority class to evaluate the prediction model. For class i samples, we define the prediction results as follows, as shown in Table 4:
We choose the following four evaluation indicators:
(1)
Accuracy (Acc): The Acc value is the ratio of the number of correctly predicted samples to the total number of samples. The calculation method is as shown in Equation (4):
Acc = n TP i + FN i n FP i + TN i + TP i + FN i
The Acc value evaluates the overall prediction, but in the case of unbalanced data sets, it is not a good indicator to measure the results.
(2)
Macro-Precision (Mac-P): The calculation method of the Precision value for class i samples is as shown in Equation (5):
Precision i = TP i TP i + FP i
In the multi-classification problem, the Precision value is divided into Macro and Micro methods. Micro-Precision focuses more on types of samples with a large number of samples, so it is more susceptible to the majority class. However, Mac-P will treat each type of sample equally, so it can better describe the model’s ability to deal with unbalanced data sets. The calculation method is as shown in Equation (6):
Mac - P = n Precision i n
(3)
Macro-F1 (Mac-F1): It is contradictory to improve the Precision value and Recall value at the same time. The F1 value is a balance point with high Precision value and high Recall value, and its calculation method is as shown in Equation (7):
F 1 i = 2 Precision i Recall i Precision i + Recall i
In the multi-classification problem, The F1 value also has Macro and Micro methods such as the Precision value. This paper selects Mac-F1, which can better take into account the minority class. The calculation method is as shown in Equation (8):
Mac - F 1 = n F 1 i n
(4)
Precision-Minority (Presmall): In order to pay more attention to the prediction effect of the model on the minority class samples after oversampling algorithms, we will calculate the Precision value of the minority class as an indicator, and its calculation method is as shown in Equation (9):
Pre s m a l l = TP s m a l l FP s m a l l + TP s m a l l

5. Experimental Design and Results

In this paper, we will design unbalanced data sets of various sizes for experiments. According to the distribution of sample data volume within each class, unbalanced data sets can be divided into two forms, linear imbalance and step imbalance. The distribution of sample data volume for the two forms is as shown in Figure 5.
In this paper, we design three linear unbalanced data sets and four step unbalanced data sets. In order to reduce the interference of the class on the Presmall in different experiments, we set the number of samples for labels 4 to 50 as the smallest minority class. We set the normal label as the large sample class, and the imbalance rate is designed to be 30, 20, and 15, through which the number of other labels can be determined. The details of the seven unbalanced data sets are shown in Table 5. For line-1 to 3, their imbalance rates are not the same. Moreover, the label linear order is shuffled. For stage-1 to 4, there are differences in the imbalance rate and the ratio of minority class and majority class labels.
In the experiment, we will use the SMOTE, LR-SMOTE, and MeanRadius-SMOTE to oversample the seven unbalanced data sets, so that each class label becomes balanced. Then, we conduct experiments on the original data set and the three processed data sets on SVM, RF, and GBDT classifiers. In order to eliminate the training bias caused by random data, all experiments were performed with 5-fold cross-validation and repeated 10 times to obtain the average number of indicators.
The experimental results of Acc, Mac-P, and Mac-F1 on the linear unbalanced data sets and step unbalanced data sets are shown in Table 6 and Table 7, where the values with bold mean the largest value in four compared models.
From Table 6, it can be found that the oversampling algorithm can effectively improve Acc, Mac-P, and Mac-F1, and MeanRadius-SMOTE is the best in most cases. In some experiments, SMOTE performs better than MeanRadius-SMOTE, but the gap between them is very small. However, in the SVM experiment, MeanRadius-SMOTE improves the three indicators much better than SMOTE and LR-SMOTE.
From Table 7, since there are more minority classes in the step unbalanced data sets, the three indicators are all lower in the experiments without oversampling, and are more affected by the imbalance rate. The SVM classifier combined with any oversampling algorithm is better than the ensemble learning classifier, and there are obvious gaps in the three indicators. On the step unbalanced data sets, MeanRadius-SMOTE outperforms SMOTE and LR-SMOTE in all cases, and the gap is especially significant on the SVM classifier.
By analyzing Acc, Mac-P, and Mac-F1, all oversampling algorithms can effectively improve the overall prediction performance of the classifier on both forms of unbalanced data sets, and the MeanRadius-SMOTE algorithm proposed in this paper has the most obvious effect. We still need to focus on the prediction performance of the algorithm on the minority class; the experimental results of Presmall are shown in Table 8, where the values with bold mean the largest value in four compared models.
From Table 8, Presmall does not even exceed five in the None experiments. SMOTE and LR-SMOTE only improved Presmall by around five in most experiments. However, MeanRadius-SMOTE can help the classifier to more accurately predict the minority class, improving Presmall by around six or seven. In addition, MeanRadius-SMOTE is more stable in experiments with different imbalance rates, and does not fluctuate greatly like SMOTE and LR-SMOTE.
To better compare the effects of SMOTE, LR-SMOTE, and MeanRadius-SMOTE, we draw the line charts of Mac-P, Mac-F1, and Presmall, as shown in Figure 6. Since the data of Acc and Mac-F1 are close and their trend is basically the same, we only choose Mac-F1 to draw the line chart.
According to Figure 6, the following conclusions can be drawn:
(1)
Since these seven unbalanced data sets are homologous, the better the oversampling algorithm, the closer the indicators should be. Comparing the nine charts, all indicators are relatively stable in the MeanRadius-SMOTE experiment, which is less affected by the imbalance rate and data set form, and this stabilization is more obvious in the SVM classifier. This shows that MeanRadius-SMOTE has good robustness.
(2)
Analyzing the three charts—Figure 6a,d,g, in the seven data sets, MeanRadius-SMOTE on the SVM classifier can not only ensure that the overall prediction indicators reach about 0.9 but also ensure that Presmall is relatively high, about 0.75.
(3)
Comparing the three charts—Figure 6g–i, the SVM experiment can achieve a higher Presmall, and in most experiments, Presmall is greatly affected by the data sets, especially in the RF experiments. However, only in the model composed of MeanRadius-SMOTE and SVM do we obtain a very flat line, which shows that this model has good robustness and accuracy in predicting the minority class.
(4)
Comparing the three charts—Figure 6a–c, for SMOTE and LR-SMOTE, LR-SMOTE performs better than SMOTE on SVM, while it is the opposite on RF and GBDT. In addition, SMOTE even outperforms MeanRadius-SMOTE in some GBDT experiments. LR-SMOTE is also an oversampling algorithm for binary classification problems, which is more suitable for a classifier that is essentially a binary classification algorithm-SVM. Therefore, it can be inferred that MeanRadius-SMOTE is also more suitable for SVM classifiers.
In summary, MeanRadius-SMOTE shows excellent performance in all experiments, which can take into account the prediction performance of the overall and minority class. In individual experiments, SMOTE is slightly higher than MeanRadius-SMOTE in Acc, Mac-P, and Mac-F1, but lower than MeanRadius-SMOTE in Presmall. We can think that this is the result of sacrificing the prediction performance of the minority class. Therefore, it can still be considered that MeanRadius-SMOTE is better than SMOTE and LR-SMOTE. Furthermore, the model composed of MeanRadius-SMOTE and SVM can improve prediction accuracy and stability.

6. Conclusions and Outlook

Mechanical fault diagnosis has always been a key issue in the PHM. Since the development of machine learning, although mechanical fault diagnosis has been solved by many effective methods, fault diagnosis under unbalanced data sets has always been a stubborn problem. The oversampling algorithm is currently recognized as an effective means to solve the problem of unbalanced data sets. The traditional oversampling algorithm is not only affected by the sample distribution, but also easily generates noise samples, which makes the decision boundary blurred. These drawbacks are not conducive to the classifier making predictions.
Based on the SMOTE, this paper proposes the new algorithm, MeanRadius-SMOTE, combining the sample center and radius. MeanRadius-SMOTE effectively avoids useless samples and noise samples in the process of generating new samples. In this paper, we conduct comparative experiments for SMOTE, LR-SMOTE, and MeanRadius-SMOTE algorithms and use SVM, RF, and GBDT classifiers on three linear unbalanced data sets and four step unbalanced data sets. Experimental results show that the MeanRadius-SMOTE algorithm can effectively balance data classes and improve the prediction performance of machine learning classifiers. From the perspective of various indicators, the MeanRadius-SMOTE algorithm is better than SMOTE and LR-SMOTE, and has better robustness. In the problem of unbalanced data sets, MeanRadius-SMOTE can more accurately predict the minority class without sacrificing the prediction performance of other classes, which is of great significance for mechanical fault diagnosis, and the combined model of MeanRadius-SMOTE and SVM is proved to be much better than other models.
Although this paper proves on PHM09 challenge data that MeanRadius-SMOTE has a good ability to deal with unbalanced data sets, considering the actual situation, future research can be carried out from the following aspects:
(1)
In this paper, in order to ensure that the experiment is carried out under a variety of unbalanced data sets, we use artificial unbalanced data sets in experiments. In future research, we will collect the failure unbalanced data sets of actual mechanical equipment to continue the verification experiment.
(2)
When constructing the data set in this paper, we only extracted the time–frequency domain features from the vibration signal. Currently, there are more methods to extract features from vibration signals, such as convolutional neural networks, wavelet packet decomposition, etc. Training sets composed of different types of features may have an impact on the performance of MeanRadius-SMOTE.

Author Contributions

Data curation, F.D. and S.Z.; validation, F.D.; investigation, S.Z. and Y.Y.; writing—original draft preparation, F.D. and Y.Y.; writing—review and editing, F.D., S.Z., and Z.C.; supervision, S.Z. and Z.C. All authors have read and agreed to the published version of the manuscript.

Funding

This article was supported by the National Natural Science Foundation of China (71871181); the Basic Research Project of Natural Science of Shaanxi Province (2022JM-433); the Key R&D Program of Shaanxi Province (2022GY-207, 2021ZDLGY10-06).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data in this article are the 2009 PHM data challenge of the gearbox.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Yongbo, L.; Wang, S.; Deng, Z. Intelligent fault identification of rotary machinery using refined composite multi-scale Lempel–Ziv complexity. J. Manuf. Syst. 2021, 61, 725–735. [Google Scholar]
  2. Jiang, W.; Xie, C.; Zhuang, M.; Shou, Y.; Tang, Y. Sensor data fusion with Z-numbers and its application in fault diagnosis. Sensors 2016, 16, 1509. [Google Scholar] [CrossRef] [PubMed]
  3. Liu, R.; Yang, B.; Zio, E.; Chen, X. Artificial intelligence for fault diagnosis of rotating machinery: A review. Mech. Syst. Signal. Process. 2018, 108, 33–47. [Google Scholar] [CrossRef]
  4. Lu, S.; He, Q.; Yuan, T.; Kong, F. Online fault diagnosis of motor bearing via stochastic-resonance-based adaptive filter in an embedded. IEEE Trans. Syst. Man Cybern A 2018, 47, 1111–1122. [Google Scholar] [CrossRef]
  5. Cai, B.; Zhao, Y.; Liu, H.; Xie, M. A data-driven fault diagnosis methodology in Three-Phase inverters for PMSM drive systems. IEEE Trans. Power Electron. 2017, 32, 5590–5600. [Google Scholar] [CrossRef]
  6. Cofre-Martel, S.; Droguett, E.; Modarres, M. Big Machinery Data Preprocessing Methodology for Data-Driven Models in Prognostics and Health Management. Sensors 2021, 21, 6841. [Google Scholar] [CrossRef]
  7. Akilu, Y.; Sinha, J.; Nembhard, A. A novel fault diagnosis technique for enhancing maintenance and reliability of rotating machines. Struct. Health Monit. 2015, 14, 604–621. [Google Scholar]
  8. Yongbo, L.; Wang, X.; Si, S.; Huang, S. Entropy Based Fault Classification Using the Case Western Reserve University Data: A Benchmark Study. IEEE Trans. Reliab. 2020, 69, 754–767. [Google Scholar]
  9. Fernandez, A.; Garcia, S.; Luengo, J.; Bernado-Mansilla, E.; Herrera, F. Genetics-based machine learning for rule induction: State of the art, Taxonomy, and Comparative Study. IEEE Trans. Evol. Comput. 2010, 14, 913–941. [Google Scholar] [CrossRef]
  10. Swana, E.; Doorsamy, W.; Bokoro, P. Tomek Link and SMOTE Approaches for Machine Fault Classification with an Imbalanced Dataset. Sensors 2022, 22, 3246. [Google Scholar] [CrossRef]
  11. He, Z.; Shao, H.; Jing, L.; Cheng, J.; Yang, Y. Transfer fault diagnosis of bearing installed in different machines using enhanced deep auto-encoder. Measurement 2020, 152, 107393. [Google Scholar]
  12. Zhang, T.; Chen, J.; Li, F.; Zhang, K.; Lv, H.; He, S.; Xu, E. Intelligent fault diagnosis of machines with small & imbalanced data: A state-of-the-art review and possible extensions. ISA Trans. 2022, 119, 152–171. [Google Scholar] [PubMed]
  13. Shao, H.; Lin, J.; Zhang, L.; Galar, D.; Kumar, U. A novel approach of multisensory fusion to collaborative fault diagnosis in maintenance. Inf. Fusion 2021, 74, 65–76. [Google Scholar] [CrossRef]
  14. Martin-Diaz, I.; Morinigo-Sotelo, D.; Duque-Perez, O.; De, R. Early fault detection in induction motors using adaboost with imbalanced small data and optimized sampling. IEEE Trans. Ind. Electron. 2022, 53, 3066–3075. [Google Scholar] [CrossRef]
  15. Zhang, T.; Chen, J.; Li, F.; Pan, T.; He, S. A small sample focused intelligent fault diagnosis scheme of machines via multi-modules learning with gradient penalized generative adversarial networks. IEEE Trans. Ind. Electron. 2021, 68, 10130–10141. [Google Scholar] [CrossRef]
  16. Branco, P.; Torgo, L.; Ribeiro, R. A survey of predictive modeling on imbalanced domains. ACM Comput. Surv. 2017, 45, 351–400. [Google Scholar] [CrossRef]
  17. Haixiang, G.; Yijing, L.; Shang, J.; Mingyun, G.; Yuanyue, H.; Bing, G. Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl. 2017, 73, 220–239. [Google Scholar] [CrossRef]
  18. Chawla, N.; Hall, L.; Bowyer, K.; Kegelmeyer, W. SMOTE: Synthetic minority over-sampling technique. Artif. Intell. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  19. Han, H.; Wang, W.; Mao, B. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Proceedings of the Conference on Aristophanes Upstairs and Downstairs, Magdalen Coll, Oxford, UK, 16–18 September 2004. [Google Scholar]
  20. Haibo, H.; Yang, B.; Garcia, E.; Shutao, L. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the International Joint Conference on Neural Networks, Hong Kong, China, 1–8 June 2008. [Google Scholar]
  21. Liang, X.; Jiang, A.; Li, T.; Xue, Y.; Wang, G. LR-SMOTE—An improved unbalanced data set oversampling based on K-means and SVM. Knowl.-Based Syst. 2020, 196, 105846. [Google Scholar] [CrossRef]
  22. Yen, S.; Lee, Y. Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. In Proceedings of the International Conference on Intelligent Computing, Kunming, China, 16–19 August 2006. [Google Scholar]
  23. Laurikkala, J. Improving identification of difficult small classes by balancing class distribution. In Proceedings of the 8th Conference on Artificial Intelligence in Medicine in Europe, Cascais, Portugal, 1–4 July 2001. [Google Scholar]
  24. Zhang, C.; Gao, W.; Song, J.; Jiang, J. An imbalanced data classification algorithm of improved autoencoder neural network. In Proceedings of the 8th International Conference on Advanced Computational Intelligence, Chiang Mai, Thailand, 14–16 February 2016. [Google Scholar]
  25. Cheng, F.; Zhang, J.; Wen, C. Cost-sensitive large margin distribution machine for classification of imbalanced data. Pattern Recognit. Lett. 2016, 80, 107–112. [Google Scholar] [CrossRef]
  26. Xu, Q.; Lu, S.; Jia, W.; Jiang, C. Imbalanced fault diagnosis of rotating machinery via multi-domain feature extraction and cost-sensitive learning. J. Manuf. Syst. 2020, 31, 1467–1481. [Google Scholar] [CrossRef]
  27. Zhao, X.; Jia, M.; Lin, M. Deep laplacian auto-encoder and its application into imbalanced fault diagnosis of rotating machinery. Measurement 2020, 31, 1467–1481. [Google Scholar] [CrossRef]
  28. Yongbo, L.; Du, X.; Wang, X.; Si, S. Industrial gearbox fault diagnosis based on multi-scale convolutional neural networks and thermal imaging. ISA Trans. 2022, in press. [Google Scholar]
  29. Yongbo, L.; Wang, S.; Yang, Y.; Deng, Z. Multiscale symbolic fuzzy entropy: An entropy denoising method for weak feature extraction of rotating machinery. Mech. Syst. Signal. Process. 2022, 162, 108052. [Google Scholar]
  30. Zhang, C.; Chen, J.; Guo, X. A gear fault diagnosis method based on EMD energy entropy and SVM. Shock Vib. 2010, 29, 216–220. [Google Scholar]
  31. Yongbo, L.; Wang, X.; Si, S.; Du, X. A New Intelligent Fault Diagnosis Method of Rotating Machinery under Varying-Speed Conditions Using Infrared Thermography. Complexity 2019, 2019, 2619252. [Google Scholar]
  32. Farajzadeh-Zanjani, M.; Razavi-Far, R.; Saif, M. Efficient sampling techniques for ensemble learning and diagnosing bearing defects under class imbalanced condition. In Proceedings of the IEEE Symposium Series on Computational Intelligence, Athens, Greece, 6–9 December 2016. [Google Scholar]
  33. Mao, W.; He, L.; Yan, Y.; Wang, J. Online sequential prediction of bearings imbalanced fault diagnosis by extreme learning machine. Mech. Syst. Signal. Process. 2017, 83, 450–473. [Google Scholar] [CrossRef]
  34. Razavi-Far, R.; Farajzadeh-Zanjani, M.; Saif, M. An integrated class-imbalanced learning scheme for diagnosing bearing defects in induction motors. IEEE Trans. Ind. Inform. 2017, 13, 2758–2769. [Google Scholar] [CrossRef]
  35. Jia, F.; Lei, Y.; Lu, N.; Xing, S. Deep normalized convolutional neural network for imbalanced fault classification of machinery and its understanding via visualization. Mech. Syst. Signal. Process. 2018, 110, 349–367. [Google Scholar] [CrossRef]
  36. Lan, Y.; Han, X.; Xiong, X.; Huang, J.; Zong, W.; Ding, X.; Ma, B. Two-step fault diagnosis framework for rolling element bearings with imbalanced data based on GSA-WELM and GSA-ELM. Proc. Inst. Mech. Eng. Part C J. Mech. Eng. Sci. 2018, 232, 2937–2947. [Google Scholar] [CrossRef]
  37. Jing, L.; Zhao, M.; Li, P.; Xu, X. A convolutional neural network based feature learning and fault diagnosis method for the condition monitoring of gearbox. Measurement 2017, 111, 1–10. [Google Scholar] [CrossRef]
  38. Zhang, K.; Zhoudong, H.; Yi, C. Review of multiple fault diagnosis methods. IET Control Theory Appl. 2015, 32, 1143–1157. [Google Scholar]
  39. Lei, Y.; He, Z.; Zi, Y.; Hu, Q. Fault diagnosis of rotating machinery based on multiple ANFIS combination with GAs. Mech. Syst. Signal. Process. 2007, 21, 2280–2294. [Google Scholar] [CrossRef]
Figure 1. The flow chart of the MeanRadius-SMOTE algorithm.
Figure 1. The flow chart of the MeanRadius-SMOTE algorithm.
Sensors 22 05166 g001
Figure 2. New samples distribution under different θ.
Figure 2. New samples distribution under different θ.
Sensors 22 05166 g002
Figure 3. New samples on oversampling algorithms: (a) SMOTE, (b) LR-SMOTE, (c) MeanRadius-SMOTE.
Figure 3. New samples on oversampling algorithms: (a) SMOTE, (b) LR-SMOTE, (c) MeanRadius-SMOTE.
Sensors 22 05166 g003
Figure 4. Gearbox used in PHM 2009 challenge data.
Figure 4. Gearbox used in PHM 2009 challenge data.
Sensors 22 05166 g004
Figure 5. Two imbalance forms: (a) linear imbalance, (b) step imbalance.
Figure 5. Two imbalance forms: (a) linear imbalance, (b) step imbalance.
Sensors 22 05166 g005
Figure 6. The line charts of Mac-P, Mac-F1, and Presmall.
Figure 6. The line charts of Mac-P, Mac-F1, and Presmall.
Sensors 22 05166 g006
Table 1. The information of the two-dimension feature samples.
Table 1. The information of the two-dimension feature samples.
Sample 1Sample 2Sample 3Sample 4Sample 5Sample 6Sample 7Sample 8
Feature 134675235.5
Feature 2632452−10
Table 2. A brief description of the faults.
Table 2. A brief description of the faults.
LabelDescription
Label 1Good
Label 2Gear chipped and eccentric
Label 3Gear eccentric
Label 4Gear eccentric and broken, bearing ball fault
Label 5Gear chipped and eccentric and broken, bearing inner and ball and outer fault
Label 6Gear broken, bearing inner and ball and outer fault, shaft imbalance
Label 7Bearing inner fault, shaft keyway sheared
Label 8Bearing ball and outer fault, shaft imbalance
Table 3. The time–frequency domain features.
Table 3. The time–frequency domain features.
Time-Domain FeatureFrequency-Domain Feature
F 1 = n = 1 N x ( n ) N F 7 = n = 1 N ( x ( n ) F 1 ) 4 ( N 1 ) F 2 4 F 12 = k = 1 K s ( k ) K F 18 = k = 1 K f k 2 s ( k ) k = 1 K s ( k )
F 2 = n = 1 N ( x ( n ) F 1 ) 2 N 1 F 8 = F 5 F 4 F 13 = k = 1 K ( s ( k ) F 12 ) 2 K 1 F 19 = k = 1 K f k 4 s ( k ) k = 1 K f k 2 s ( k )
F 3 = ( n = 1 N x ( n ) N ) 2 F 9 = F 5 F 3 F 14 = k = 1 K ( s ( k ) F 12 ) 3 K ( F 13 ) 3 F 20 = k = 1 K f k 2 s ( k ) k = 1 K s ( k ) k = 1 K f k 4 s ( k )
F 4 = n = 1 N ( x ( n ) ) 2 N F 10 = F 4 1 N n = 1 N | x ( n ) | F 15 = k = 1 K ( s ( k ) F 12 ) 4 K F 13 2 F 21 = F 17 F 16
F 5 = m a x | x ( n ) | F 11 = F 5 1 N n = 1 N | x ( n ) | F 16 = k = 1 K f k s ( k ) k = 1 K s ( k ) F 22 = k = 1 K ( s ( k ) F 16 ) 3 s ( k ) K F 17 3
F 6 = n = 1 N ( x ( n ) F 1 ) 3 ( N 1 ) F 2 3 F 17 = k = 1 K ( f k F 16 ) 2 s ( k ) K F 23 = k = 1 K ( s ( k ) F 16 ) 0 . 5 s ( k ) K F 17
where x(n) is a signal series for n = 1 − N, and N is the number of data points.where s(k) is a signal series for k = 1 − K, and K is the number of spectrum lines; fk is the frequency value of the kth spectrum line.
Table 4. Predicting results for class i samples.
Table 4. Predicting results for class i samples.
Positive PredictionNegative Prediction
Positive classTPiFNi
Negative classFPiTNi
Table 5. Unbalanced data sets description.
Table 5. Unbalanced data sets description.
Imbalance FormsNameNumber of SamplesImbalance Rate
Label 1Label 2Label 3Label 4Label 5Label 6Label 7Label 8
linearline-11500465258506728791293108630
line-210008645925072832118545720
line-37505504505015035065025015
stepstage-11500501500501500150015005030
stage-275050750507507507505015
stage-3150050150050505050150030
stage-4750507505050505075015
Table 6. Experimental results of the linear unbalanced data set.
Table 6. Experimental results of the linear unbalanced data set.
Data
Set
MethodsSVMRFGBDT
AccMac-PMac-F1AccMac-PMac-F1AccMac-PMac-F1
line-1None0.86750.88960.84840.77260.81220.72560.81260.82430.7842
SMOTE0.90450.91480.89970.85550.86850.84710.87740.88490.8731
LR-SMOTE0.90650.91610.90120.83390.85280.81820.86620.87460.8591
MR-SMOTE0.92060.92330.91860.86780.87300.86430.87390.87730.8698
line-2None0.87330.89450.86070.76680.80750.72430.82090.84120.8011
SMOTE0.88910.90620.88360.85480.86290.85010.86260.87130.8566
LR-SMOTE0.89230.90590.88650.83540.84970.82710.85880.86850.852
MR-SMOTE0.91390.91600.91310.86750.87020.86570.87330.87800.8698
line-3None0.87540.88900.86440.7920.82610.75830.83440.84960.818
SMOTE0.89950.90690.89540.86460.87260.86180.87480.87820.8716
LR-SMOTE0.89880.90580.89470.84640.85750.84150.86830.87300.8639
MR-SMOTE0.91750.91830.91680.86910.87200.86790.88030.88070.8784
Table 7. Experimental results of the step unbalanced data set.
Table 7. Experimental results of the step unbalanced data set.
Data
Set
MethodsSVMRFGBDT
AccMac-PMac-F1AccMac-PMac-F1AccMac-PMac-F1
Stage-1None0.74030.82070.70660.61440.76100.50510.67930.75530.6194
SMOTE0.84180.86850.83320.76140.81660.74470.82500.84870.8182
LR-SMOTE0.85660.87890.85120.71030.79500.67290.80210.84030.7915
MR-SMOTE0.90390.90620.90230.8440.85920.84080.85960.87060.8561
Stage-2None0.77460.83650.75280.63980.76940.55380.71930.79360.6838
SMOTE0.85750.87600.85250.77900.82420.76820.83680.85510.8330
LR-SMOTE0.86490.88330.86020.74290.80730.72020.82050.84810.8142
MR-SMOTE0.90640.90780.90510.8380.85290.83570.86210.87230.8601
Stage-3None0.64650.80340.63690.45340.76690.36070.56510.69990.5259
SMOTE0.78280.84810.78470.73900.81810.73360.80480.82970.8022
LR-SMOTE0.81180.85460.81160.67660.80010.66540.76410.80440.7583
MR-SMOTE0.87710.88260.87590.81630.83660.81510.83510.84310.8327
Stage-4None0.68230.81240.67670.51860.76970.46710.64910.73910.6334
SMOTE0.82210.86060.82230.7670.81190.76340.80980.83150.8082
LR-SMOTE0.84400.87000.84340.71860.79570.71310.78710.82050.7848
MR-SMOTE0.87660.88290.87620.81350.82780.81190.84360.85130.8425
Table 8. Presmall on the data sets.
Table 8. Presmall on the data sets.
Data
Set
MethodsSVM
Presmall
RF
Presmall
GBDT
Presmall
Data
Set
MethodsSVM
Presmall
RF
Presmall
GBDT
Presmall
line-1None0.2770.0480.184stage-1None0.3290.0390.154
SMOTE0.5630.4720.588SMOTE0.5550.4840.545
LR-SMOTE0.5530.3380.508LR-SMOTE0.5840.3220.506
MR-SMOTE0.7030.6250.603MR-SMOTE0.7810.6150.619
line-2None0.3580.0520.261stage-2None0.4030.0620.259
SMOTE0.4270.5540.519SMOTE0.6160.5060.637
LR-SMOTE0.5010.4490.433LR-SMOTE0.6060.360.555
MR-SMOTE0.7680.6810.612MR-SMOTE0.7910.6950.690
line-3None0.4030.1100.308stage-3None0.4450.0420.368
SMOTE0.5830.6070.618SMOTE0.6570.7380.697
LR-SMOTE0.5790.5250.565LR-SMOTE0.6620.6050.616
MR-SMOTE0.7910.7010.681MR-SMOTE0.7680.7380.708
stage-4None0.4760.1640.42
SMOTE0.7320.7450.697
LR-SMOTE0.7540.6480.705
MR-SMOTE0.7830.7600.778
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Duan, F.; Zhang, S.; Yan, Y.; Cai, Z. An Oversampling Method of Unbalanced Data for Mechanical Fault Diagnosis Based on MeanRadius-SMOTE. Sensors 2022, 22, 5166. https://doi.org/10.3390/s22145166

AMA Style

Duan F, Zhang S, Yan Y, Cai Z. An Oversampling Method of Unbalanced Data for Mechanical Fault Diagnosis Based on MeanRadius-SMOTE. Sensors. 2022; 22(14):5166. https://doi.org/10.3390/s22145166

Chicago/Turabian Style

Duan, Feng, Shuai Zhang, Yinze Yan, and Zhiqiang Cai. 2022. "An Oversampling Method of Unbalanced Data for Mechanical Fault Diagnosis Based on MeanRadius-SMOTE" Sensors 22, no. 14: 5166. https://doi.org/10.3390/s22145166

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop