A Novel End-To-End Feature Selection and Diagnosis Method for Rotating Machinery

Feature selection is to obtain effective features from data, also known as feature engineering. Traditional feature selection and predictive model learning are separated, and there is a problem of inconsistency of criteria. This paper presents an end-to-end feature selection and diagnosis method that organically unifies feature expression learning and machine prediction learning into one model. The algorithm first combines the prediction model to calculate the mean impact value (MIVs) of the feature and realizes primary feature selection for the prediction model by selecting the feature with a larger MIV. In order to take into account the performance of the feature itself, the within-class and between-class discriminant analysis (WBDA) method is proposed, and combined with the feature diversity strategy, the feature-oriented secondary selection is realized. Eventually, feature vectors obtained by two selections are classified using a multi-class support vector machine (SVM). Compared with the modified network variable selection algorithm (MIVs), the principal component analysis dimensionality reduction algorithm (PCA), variable selection based on compensative distance evaluation technology (CDET), and other algorithms, the proposed method MIVs-WBDA exhibits excellent classification accuracy owing to the fusion of feature selection and predictive model learning. According to the results of classification accuracy testing after dimensionality reduction on rotating machinery status, the MIVs-WBDA method has a 3% classification accuracy improvement under the low-dimensional feature set. The typical running time of this classification learning algorithm is less than 10 s, while using deep learning, its running time will be more than a few hours.


Introduction
Bearings are easily damaged parts in rotating machinery, and approximately 50% of motor faults are bearing related [1,2]. The machinery running noise is a type of mechanical wave, which includes a wealth of information about machine status, and propagates energy to the surrounding environment through vibration [3,4]. Both noise and vibration are caused by the elastic deformations of the rotor, and therefore, the machinery running noise is a good indicator as the vibration signal [3,5]. Compared with vibration diagnostics, the noise diagnostics have the characteristics of non-contact measurements, convenient sensor installation, no influence on machinery operation, and online monitoring. Noise diagnostics are especially suitable for occasions where the vibration signal is not easy to measure [4]. This paper studies the rotating machinery fault diagnosis method based on noise signals.
Rotating machinery noise diagnosis achieves diagnosis of machinery working conditions by monitoring the elastic waves induced by deformations, exfoliations, or cracks. Fault diagnosis can be regarded as a pattern recognition problem; artificial intelligence (AI) has attracted great attention and shows promise in rotating machinery fault recognition applications [6]. The rotating machinery fault diagnosis based on AI includes sensing, data acquisition, feature extraction, dimensionality reduction, and fault classification. Among them, feature extraction and dimension reduction are the most critical steps in the workflow [7]. They are related to the upper limit of the fault identification accuracy of the subsequent classification algorithm. Too much redundant information in high-dimension feature vectors may lead to curse of dimensionality and increasing calculation time. The principle of selection is to try not to miss a feature that may be useful, but not to abuse too many features. To extract the features, many signal processing methods have been used in the area of rotating machine health monitoring and diagnosis, such as time-domain and frequency-domain feature parameters processing [8][9][10], discrete wavelet transform (DWT) [11], empirical mode decomposition (EMD) [12], time-frequency analysis (TFA) [13], Mel-frequency cepstrum (MFC) [14], and Shannon entropy [15]. Among them, Shannon entropy features have been widely used in machine health monitoring recently. For example, the instantaneous energy distribution-permutation entropy (IED-PE) [16], the improved multiscale dispersion entropy (IMDE) [17], the composite multi-scale weighted permutation entropy (CMWPE) [18], the stationary wavelet packet Fourier entropy (SWPFE) [19], and similarity-fuzzy entropy [20] have been proposed to construct the sensitive feature for rolling balling heath monitoring. However, the construction of good sensitive features requires manual experience, which is called feature engineering problem. With the application of deep learning, some feature self-encoding methods are adopted [21]. However, the difficulty of deep learning is how to evaluate the contribution of representation learning to the final system output. At present, a more effective method is to use the final output layer as predictive learning and other layers as representation learning.
Feature selection is to select an effective subset of the original feature set, so that the model trained based on this feature subset has the highest accuracy. A direct feature selection algorithm is a subset search algorithm, and a commonly used method is to adopt a greedy strategy, such as forward search or reverse search. Subset search algorithms are divided into two types: filter and wrapper. The filter method is a feature selection method that does not depend on a specific machine learning model, while the wrapper method is a method that uses the accuracy of subsequent machine learning models as a feature selection criterion. Another feature learning is feature extraction, which is to project the original feature in a new space to obtain a new feature representation, such as principal component analysis (PCA) and auto-encoder. In existing feature selection or feature extraction algorithms, PCA transforms the original data into linearity-independent data via linear transformation, and it can be used to extract the main feature components of the data [22]. PCA expands features in the direction in which the covariance is the largest so that the obtained low dimensional features have no corresponding physical meaning. Chen B. et al. achieved selection and dimensionality reduction of intrinsic mode function (IMF) components of motor bearing via distance evaluation technology (DET) and utilized dimensionality-reduced feature vectors as input vectors for a support vector machine (SVM) [23]. Lei et al. proposed compensative distance evaluation technology (CDET) with enhanced dimensionality reduction performance, and they applied this to feature dimensionality reduction of bearing vibration signals [24]. CDET selects the features that have the smallest distance within the cluster and the largest distance between clusters. PCA, DET, and CDET do not consider the characteristics of the classification network. Melih Kuncan et al. proposed a feature extraction method based on one-dimensional ternary patterns (1D-TP) obtained from comparisons between neighbors of each value on vibration signals for bearing fault classification [25]. To solve the problems of variable redundancy and model complexity in the prediction model, Xu et al. combined the neural network and the mean impact value (MIV) for wind power prediction [26]. In addition, methods based on decision trees or GBDT for feature extraction or dimensionality reduction have been used in machinery diagnostics. Madhusudana et al. used the decision tree technique to select prominent features out of all extracted features [27]. Li et al. proposed a wrapped feature selection algorithm based on XGBoost which used the importance measure of XGBoost as a feature subset search heuristic, and it was verified on 8 data sets [28]. Aiming at the problem of variable working conditions of rotating equipment, Wu et al. proposed a deep autoencoder feature learning method and applied it to fault diagnosis of rotating equipment [29].
In terms of feature classification, neural networks [30,31] and SVM [32,33] have been widely applied in machinery diagnosis. Han et al. compared the performance of random forest, artificial neural networks and SVM methods in the intelligent diagnosis of rotating equipment [34]. Hu et al. utilized the wavelet package transform and SVM ensemble technology for fault diagnosis [35]. Liu et al. proposed a genetic algorithm (GA) based self-adaptive resonance demodulation technique [36]. Zhu et al. proposed a fault diagnosis method based on an SVM optimized by the GA [37]. Han et al. combined EMD, particle a swarm optimization SVM (PSO-SVM), and fractal box dimensions for gear fault feature extraction and fault classification [38]. Indeed, heuristic searching methods, such as the GA and simulated annealing [39] and tabu searching methods [40] have also been applied in feature classification. In addition, ensemble learning and deep neural networks are widely used in fault diagnosis [41]. Zhou et al. proposed a novel bearing diagnosis method based on ensembled empirical mode decomposition (EEMD) and weighted PE and further enhanced the classification accuracy by a mixed voting strategy and a similarity criterion [42]. Aiming at the problem of big data analysis, Wu et al. proposed a two-stage big data analytics framework and achieved a high-level of classification accuracy [43].
The conventional rotating machinery diagnosis algorithms separate the complementarity of the feature selection algorithm and classification network in feature selections. To this end, this paper proposes an end-to-end feature selection and diagnosis method that organically unifies feature expression learning and machine prediction learning into one model. This method realizes the compromise between the two types of algorithms and applies it to the state classification of machinery. First, based on the modified MIVs algorithm, our algorithm not only achieves feature selection for noise signals based on the contributions of independent variables to classified networks but also solves the randomness problem of MIVs value. By eliminating the features that have less influence on the classification, this step realizes the primary feature selection oriented to the classification network. Second, in order to characterize the metric ability of the feature itself, a new between-class sorting WBDA algorithm was introduced into the intra-class and inter-class aggregation degree calculation, and feature diversity selection strategy is proposed to prevent the phenomenon that the calculated WBDA of the features in the same category are relatively large. Experimental results show that this feature diversity selection strategy can effectively improve the accuracy of the algorithm. Thus, secondary selection of features was achieved through feature indexability. Since there are few faulty data in industrial applications, it is hoped that the diagnosis algorithm can run online. The classification network uses the SVM to compute the actual classification accuracy and removes the local optimal solution through the Monte Carlo method. The present paper compares the proposed algorithm with the MIV algorithm for network variable selection, the CDET algorithm based on variable selection, and the variable dimensionality reduction algorithm PCA. After selecting features with the same dimensions, the proposed algorithm is found to have better classification accuracy than the other methods, which verifies its superiority. This paper is organized as follows. Section 1 introduces the background, motivation, and a brief literature review of Feature learning and feature classification. Section 2 constructs the machinery noise feature set which is used for testing in Section 6. In Section 3, a bearing noise diagnosis algorithm based on network variable selection and WBDA, named MIVs-WBDA, is proposed. Since feature classifications were achieved by an SVM, Section 4 introduces two classifier parameter optimization algorithms for the SVM: PSO algorithm and the GA. Section 5 summarizes the procedures of the MIVs-WBDA. Section 6 describes the simulation testing. Finally, Section 7 presents our conclusions and some further remarks.

Feature Extraction
In practical applications, it is difficult to determine which features are key ones in advance, and classifiers based on different features may have significantly different performance. For the application of this paper, in order to verify whether the proposed feature selection algorithm can select the most suitable features from the undetermined feature set, a large number of features used in the previous literatures are constructed as the candidate feature set. These features form a feature pool. As a test, a total of 31 features were constructed in this article, which were divided into 6 classes, as shown in Figure 1.

Tranditional Time Domain Feature Set
Traditional time domain and statistic features are a powerful tool which can characterize the change of bearing vibration signals when faults occur [44]. The time-domain characteristics are more significant, which can be directly obtained from the monitoring signal, and reflect the change of energy amplitude on the time scale of the signal. It is a common index that can be used for rapid diagnosis. This paper uses 11 features shown in Table 1. Herein, x i refers to the i-th measurement of the time domain signal, s i refers to the i-th frequency domain value based on the short-time Fourier transform (STFT), and x i refers to the i-th of x i in ascending order, where N is an even number. Subscript i takes values from 1 to N. F j (j = 1 . . . 11) refers to the j-th feature of the signal. µ is the mean of signal x, and σ is the variance. These features are calculated for every short-time frame of bearing noise signal.

Feature
Name Definition

F1
Mean Number of zero crossing points

Empirical Mode Decomposition Energy Entropy
Features 12 to 17 are empirical mode decomposition energy entropy. EMD is a signal analysis method proposed by Dr. Huang in 1998 [45]. It is an adaptive data processing or mining method, which is very suitable for the processing of nonlinear and non-stationary time series. The EMD extraction method is given as the following: (a) Decompose bearing noise signals into some IMFs. (b) Calculate the energy of all IMFs (c) Calculate the energy entropy of all IMFs (d) Calculate the energy entropy of the whole original signal (e) Construct the feature vector with the first six I MF_entropy(j) and EMD_entropy [F 12 , F 13 , · · · , F 17 ] = [EMD_entropy, I MF_entropy(1), · · · , I MF_entropy(5)] Figure 2 shows the empirical mode decomposition diagram of a sample.

Permutation Entropy
Feature 18 is permutation entropy. Permutation entropy algorithm is a kind of vibration mutation detection method, which can conveniently locate the mutation time of the system and has the ability to detect the small change of the signal.
The calculation steps of PE are as follows: (1) Let the length of time series x j (j = 1, 2, ..., N) be N, and define an embedding dimension m and a time delay D.
(2) The signal is reconstructed in phase-space to obtain k (k = N− (m − 1)d) reconstructed components, and each component is represented by (3) The inner part of each subsequence X i is sorted incrementally, that is, When sorting, if two values are equal, they are sorted according to the subscript n of j n . In this way, an X i is mapped to a sequential pattern π j = (j 1 , j 2 . . . j m ), which is one of all possible sequential patterns of m number. Therefore, every m-dimensional subsequence X i is mapped to one of m! permutations.
(4) Calculate the times of each permutation pattern π j appearing in m! permutations, denoted as f π j , then the probability of each permutation pattern appearing is defined as (5) The permutation entropy of time order is defined as Obviously, 0 ≤ H p (m) ≤ log(m!). In general, H P (m) is normalized to 0-1, and is defined for this purpose.

Dispersion Entropy
Features 19 is dispersion entropy. Rostaghi [46] et al. gave the detailed calculation steps of DE as follows. For a given univariate signal of length N: x = {x 1 , x 2 , . . . , x N }, the DE algorithm includes 4 main steps: (1) First, x j , (j = 1, 2, · · · , N) are mapped to c classes, labeled from 1 to c. To do so, there are a number of linear and nonlinear approaches. The linear mapping algorithm is the fastest one. When the maximum and/or minimum values of a time series are much larger or smaller than the mean/median value of the signal, the majority of x i are assigned to only few classes. Thus, we first employ the normal cumulative distribution function (NCDF) to map x into y = y 1 , y 2 , . . . , y N from 0 to 1. Next, we use a linear algorithm to assign each y j to an integer from 1 to c. To do so, for each member of the mapped signal, we use z c j = round c·y j + 0.5 , where z c j shows the j th member of the classified time series and rounding involves either increasing or decreasing a number to the next digit. It is worth noting that this step could be done by some other linear and nonlinear mapping techniques.
(2) Each embedding vector z m,c i with embedding dimension m and time delay d is The number of possible dispersion patterns that can be assigned to each time series z m,c i is equal c m , since the signal has m members and each member can be one of the integers from 1 to c.
(3) For each of c m potential dispersion patterns, relative frequency is obtained as follows: In fact, p π v 0 ,v 1 ,··· ,v m−1 shows the number of dispersion patterns π v 0 ,v 1 ,··· ,v m−1 that are assigned to z m,c i , divided by the total number of embedding signals with embedding dimension m.
(4) Finally, based on the Shannon's definition of entropy, the DE value with embedding dimension m, time delay d, and the number of classes c is calculated as follows:

Wavelet Packet Decomposition
Features 20 to 27 are the norm of wavelet packet decomposition coefficient reconstruction signal. Wavelet decomposition expands the signal on a series of wavelet basis functions.
In engineering applications, because useful signals usually appear as low-frequency parts or some relatively stable signals, interference usually appears as high-frequency signals. Therefore, the signal can be approximated by low-frequency coefficients with a small amount of data and several high-frequency layer coefficients. Figure 3 shows a three-layer decomposition structure diagram, where cA ij and cD ij (i = 1, 2, 3 1 ≤ j ≤ 2 i−1 ) are the low-frequency and high-frequency decomposition coefficients of the corresponding layer. Feature extraction based on wavelet decomposition is divided into the following steps: (1) Wavelet packet decomposition of one-dimensional signal. Select db1 wavelet and determine the level of wavelet decomposition to be 3, and then, perform 3-level wavelet packet decomposition on signal x.
(2) Perform wavelet reconstruction on the decomposed coefficients. According to the low-frequency coefficients of the Nth layer of wavelet decomposition and the highfrequency coefficients of the first to Nth layers, a one-dimensional signal wavelet reconstruction is performed.
(3) Calculate the 2 norms of the reconstructed signal and use them as features F20-F27.

Frequency Domain Feature Set
The frequency domain features include the sum of the spectrum amplitude, the average value of the spectrum, the standard deviation of the spectrum, and the integral of the frequency domain curve, which are represented by F28-F31, respectively.

Feature Selection Algorithm for Rotating Machinery Noise Diagnosis
The rotating machinery noise diagnosis process generally includes three steps: feature extraction, feature selection (or feature dimension reduction), and state classification.
The traditional feature selection is to separate the data from the classification and map the original features into several features selected by the algorithm by dimension reduction. As shown in Table 2, the characteristics and differences of the commonly used feature filtering algorithms are mainly described, and their methods of processing data have their own focus.
Aiming at the problem that the traditional feature selection is usually separated from the learning of prediction model for rotating machinery noise diagnosis, this paper proposes a feature selection algorithm based on network variable selection and within-class and between-class discriminant analysis (WBDA). The proposed algorithm realizes the compromise between the two types of feature selection technique, as shown in Figure 4.

Feature Selection Characteristics
PCA PCA transform the original data into linearity independence data via linear transformation.
Probability PCA (PPCA) PCA does not consider the probability distribution of data, PPCA makes a probability interpretation of PCA, and extends the PCA algorithm.

Autoencoder
Autoencoder is a method of deep learning, which maps data to low dimension feature space by unsupervised method.

Primary Feature Selection Oriented to the Classification Network-MIVs-SVM
The selection of meaningful time-frequency features of noises as SVM input is a key step for status predictions. The MIV is considered as one of the most effective indexes to evaluate the influence of variables on the output of neural network. However, when the neural network is used as classification network to calculate the MIV of the feature variable, the calculated MIVs have great randomness because the parameters of the neural network obtained by each training are not the same. Figure 5 shows the randomness when the neural network is used to calculate MIV. Among them, the abscissa is the characteristic, and the ordinate is the MIV.
Since SVM is used for fault classification, this algorithm uses SVM network to calculate MIV named MIVs-SVM. Considering that, the final output of SVM is the sample belonging to a class rather than a continuous output value. After the SVM classification hyperplane is obtained by training, the estimation value of posterior probability P(y i = c|x i ), c ∈ {1, 2, . . . , N} of sample x i belonging to each class c is calculated in this paper by Softmax Regression function at first, and then the probability corresponding to the real class of sample x i is selected as the output result. The specific calculation method is shown in Figure 6 and is described as follows:  (a) After the network training, each feature variable in the training sample P was increased and decreased by 10% to obtain training samples P 1 and P 2 , respectively. P 1 and P 2 were input into the established networks, and Softmax Regression function was applied to the output of SVM network. Two new classification results are represented by A 1 and A 2 .
(b) The difference between A 1 and A 2 was obtained and regarded as the impact value (IV) of independent variable variation on the output. (e) The effects of each independent variable on the output were evaluated based on their absolute MIV, and then the effects of the input feature on the results were evaluated, thus achieving variable selection.
Since this modified method directly uses the subsequent classification network SVM to calculate the MIV, it is called MIVs-SVM, abbreviated as MIVs.

Secondary Feature Selection Based on Feature Divisibility-WBDA
The effects of feature variables on the output were sorted based on network feature selection, which reflected the correlation of feature selection algorithms and feature classification algorithms. It provides references for variable selection oriented to the classification network. Nevertheless, to evaluate the divisibility of features, we hope that the features in the same sample are as close as possible, while the features of different samples are as far as possible. To this end, the idea of WBDA was introduced.
The idea of WBDA comes from linear discriminant analysis (LDA). The idea of LDA is very naive: given the set of training samples, try to project the samples onto a straight line, so that the projection points of the same kind of samples are as close as possible, and the projection points of different samples are as far away as possible. LDA is used for feature dimensionality reduction, so it is necessary to construct the optimal linear transformation W. In this case, the purpose of the algorithm is feature selection, so the linear transformation can be omitted. The specific algorithm is described as follows.
For any feature x k , define within-class divergence where S 2 i is called the divergence of X i .
Define between-class divergence Therefore, the larger J b and the smaller J w are the better. Taking these two points into consideration, the objective function is defined as In order to prevent the phenomenon that the calculated WBDA of the features in the same category are relatively large, so that the selected features do not have the characteristics of diversity, this paper proposes a between-class selection strategy, that is, select the maximum WBDA value of one class each time, then select the maximum value among the remaining classes next time. Once a certain class participates in the selection, it will not participate in the selection of subsequent features until all features in all classes are selected. After that, feature selection will go to the next cycle.

Classifier and Its Parameter Optimization
Feature classifications were achieved using the SVM. Multi core support vector machine is suitable for complex industrial environment, which requires relatively less hard-ware resources and has stable classification effect and good generalization performance. Let the training set be T = {(x 1 , y 1 ), (x 2 , y 2 ), · · · , (x m , y m )}, where x i is the i-th input data, and y i ∈ {−1, 1} is its corresponding output label. The process of the SVM processing the nonlinear binary classification problem is shown below [30]: (1) Select the appropriate kernel function K x i , x j = Φ(x i )Φ x j and the appropriate penalty parameter C > 0 to construct the following constraint optimization problem: where Φ(x) is the mapping function, and Φ(x i )Φ x j is the inner product of Φ(x i ) and Φ x j .
where w * cannot be directly and explicitly evaluated.
(4) Find all of the S support vectors (x s , y s ) on the maximum interval boundary,

and the classification decision function is
The kernel function is equivalent to transforming the original input space into a new feature space through the mapping function and learning the linear support vector machine from the training samples in the new feature space. Learning is implicitly done in the feature space. In practical applications, the choice of kernel function needs to be verified by experiments. The radial basis kernel function is chosen in this paper.
The performance of the SVM classifier is mainly affected by the penalty factor (C) and the nuclear parameter (γ). The nuclear function mainly reflects the complicity of sample data in high-dimension space, meanwhile, the penalty factor affects the generalization capability of the SVM by tuning the ratio of confidence interval and empiric risk in the feature space. Hence, optimization of SVM performance is usually converted into optimization selection of (C, γ) by parameters. Conventional optimization algorithms include the PSO algorithm and the GA.
PSO employs the swarm-based global searching strategy and the speed-displacement model and involves no complicated genetic procedures. The unique memory capability of PSO allows dynamic tracking of the current searching situation. Indeed, PSO can be regarded as searching of a swarm consisting of m particles Z = {Z 1 , Z 2 , . . . , Z m } in an n-dimensional space, and the location of each particle Z i = {z i1 , z i2 , . . . , z in } refers to a solution. The optimized solution of each particle obtained is denoted as p id , and the optimized solution in the particle swarm is denoted as p gd . The particle speeds are denoted as V i = { v i1 , v i2 , . . . , v in } and the renewal rule of V i in cases of two optimized solutions is as follows [38]: v id (t + 1) = wv id (t) + η 1 rand()(p id − z id (t)) + η 2 rand() p gd − z id (t) (17) z id (t + 1) = z id (t) + v id (t + 1) (18) where v id (t + 1) refers to the speed of the i-th particle at the (t + 1)-th iteration in the d-th dimension, w refers to the weight, η 1 and η 2 refer to acceleration constants, and rand() refers to a random number between 0 and 1. The GA is a parallel random searching optimization approach that mimics biological evolution [42]. Individuals are selected by selection, cross, and mutation in genetics according to the selected fitness function to retain individuals with good applicability and exclude individuals with poor applicability. In this way, the new generation inherits information from the old generation and outperforms the old generation. This process is repeated until the requirements are satisfied.
Optimizations of network classification parameters by this algorithm classifier were achieved using the two optimization algorithms.

Network Variable Selection and WBDA Fusion-Oriented Rotating Machinery Noise Diagnosis Algorithm
The network variable selection and WBDA fusion-oriented rotating machinery noise diagnosis algorithm (MIVs-WBDA algorithm) is a feature selection algorithm using network variable selection and WBDA. First, features were selected according to the contributions of independent variables on the classified network, thus achieving classified network oriented variable primary selection. Then, secondary feature selection and dimensionality reduction were achieved according to WBDA, which reflects the divisibility, thus achieving SVM identification. The steps were as follows: (1) According to the calculated data feature set, samples were randomly divided into training samples, cross-validation reference samples, and testing samples. Cross-validation is a statistical analysis method to validate classifier performance, and experimental results demonstrated that the effectiveness of SVM training based on parameters selected by the cross-validation set was higher than that based on randomly selected parameters. Therefore, the feature MIVs was calculated by cross validation samples.
(2) After excluding N features with significant MIVs and features with negligible MIVs, the remaining features were arranged in the order of ascending between-class WBDA. According to the dimensionality after dimensionality reduction (L), a new feature vector consisting of the first L-N features and N features with significant MIVs was generated.
(3) According to the SVM optimization algorithm, the (C, γ) of the SVM was optimized using the cross-validation set.
(4) We conducted learning based on the training set and tested the identification accuracy of the current SVM. Figure 7 shows the MIVs-WBDA algorithm flow and the relationship between the two feature selection algorithms and other modules in the algorithm. The result of primary feature selection is controlled by the classifier type, and secondary feature selection is mainly conducted for the residual feature set according to the characteristics of the feature itself. The feature metric chosen for secondary feature selection is the WBDA defined in this paper. Therefore, we produce a feature selection algorithm for network variable selection and WBDA fusion. The superiority of this method is proved in Section 6.  Step 5: According to the SVM optimization algorithm, the cross-validation set is used to optimize the selection of support vector machines (C, γ) Step 6: Learn through the training set, and test the SVM output classification result O and recognition accuracy R

Testing Data
In this experiment, the Machinery Fault Simulator TM MFS-MG2010 was taken as the research object; its mechanical structure is shown in Figure 8, and the specific instrument details are shown in Table 3. The pickup is installed on a moving trolley, so that one device can monitor multiple devices. The fault bearing uses a 1-inch rolling bearing standard fault kit that includes an inner race fault bearing, an outer race fault bearing, a ball fault bearing and a combined fault bearing. Among them, the combined fault bearing is a combination of three fault types: inner race fault, outer race fault, and ball fault. The vibration features can be greatly affected by the fault edge profiles [47]. Figure 9 is a physical map of three types of faults. Figure 10 is Experimental environment and some testing instruments. The fault is a small round hole with a diameter of 2-3 mm and a depth of about 0.5 mm of the testing bearings. The noise signal of five modes (normal, inner race fault, outer race fault, ball fault, and combined fault) at motor speed of 1800 rpm and sampling frequency of 44.1 kHz were obtained and are shown in Figure 10. The signal was obtained through the pickup. The x-axis is the number of sampling points, and the y-axis is the signal amplitude. Since we only focus on the relative trend of signal amplitude over time and do not pay attention to its actual size, the y-axis unit is not marked in the figure; this is common practice [42,48].    In this study, 720 training sets, 360 cross-validation sets, and 120 testing sets were generated randomly by Matlab. Figure 11 shows the average absolute MIVs and WBDA of samples in the five different clusters with the cross-validation set as the feature set. From Figure 11, F10 is significantly different from the other features (and it can be directly selected), but the remaining features have similar MIVs, and thus it is not persuasive to evaluate features with similar MIVs according to the network features. Therefore, the WBDA of the features in the five clusters is calculated using the crossvalidation set. The results are shown in Figure 12. According to the between-class selection strategy, based on the WBDA value, the order of feature selection is F24, F29, F14, F10, F19, F18, F22, F28, F3 . . .
To visualize the dimensionality reduction data, the dimension after reduction was set to be two. We arrange the WBDA in ascending order and select the largest feature. The two-dimensional (2D) feature vector consisting of the largest WBDA (F24), whose corresponding actual features was the norm of three-layer wavelet packet decomposition coefficient cA 33 and the largest MIVs (margin factor) is the 2D feature vector selected by the MIVs-WBDA algorithm. For easy comparison, Figure 13 shows the 2D feature distributions of the five different clusters for PCA, CDET, MIV, MIVs-SVM, WBDA, and MIVs-WBDA algorithms. Since the five types are nonlinearly separable, it is difficult to see from the figure which feature dimension reduction algorithm works better.

Effects of MIVs-WBDA and Network Optimization Algorithm on the Classification
Accuracy of the SVM According to the 2D feature vectors obtained by different feature extraction methods, samples were classified into five clusters using PSO, GA optimization, and conventional SVM classifiers. Table 4 summarizes classification accuracies of the different dimensionality reduction and optimization algorithms. In the table, SVM refers to the conventional SVM classifier. It can be seen from the table that the MIVs-WBDA performs better than the other three feature extraction algorithms regardless of whether the optimization algorithm is used. The MIVs-WBDA algorithm exhibited the highest classification accuracy owing to the complementarity of two parts in algorithms. For this example, after using PSO optimization, the MIV-FE algorithm reaches 90.8% classification accuracy. As can be seen from the table, the MIVs-WBDA algorithm has an improvement of classification accuracy of about 3%. Because of reflection, interference, diffraction, and multi-interference sources when the noise signal propagates in the air, the noise diagnosis algorithm is susceptible to the environment, so the classification accuracy is lower than the classification algorithm based on the vibration signal [4]. Figure 14 shows executive procedures of GA and PSO algorithms combined with the MIVs-WBDA method.     Figure 15 shows the confusion matrix of classification results obtained by the proposed MIVs-WBDA algorithm. As observed, it is difficult to distinguish Normal and Ball when the feature vector dimension is 2. This can also explain why the classification accuracy of the experimental results is just 90%. In fact, the accuracy of noise diagnosis is lower than the diagnosis based on vibration signals. Its typical accuracy is less than 90%.  Table 5 and Figure 16 illustrate the effects of different feature dimensions on the classification accuracy of the SVM. As observed, when the feature vector dimension is greater than 2, the classification accuracy of MIVs-WBDA is the highest, indicating its excellent feature selection performance. In addition, the function of the classification accuracy and the feature dimension is a concave function.

Algorithm Complexity
The algorithm complexity can be expressed by program runtime. Table 6 presents the testing environment. Table 7 and Figure 17 illustrate the relation between CPU operation time and the feature dimension.
We can analyze the running efficiency of each algorithm through the Table and Figure. It can be seen that the typical running time of most algorithms, including MIVs-WBDA, is less than 10S, which is completely acceptable in practical application. Compared with the traditional methods, the deep learning method has more time overhead, which is more expensive. The above experiments highlight the advantages of MIVs-WBDA through the aspects of operation efficiency and accuracy.

Conclusions and Future Works
Since redundant information in high-dimension feature vectors may lead to curse of dimensionality and increasing calculation time, this paper proposes an end-to-end feature selection and dimension reduction method (MIVs-WBDA), and compares it to popular PCA, CDET, MIV, FA, LPP, NPE, and PPCA dimensionality reduction methods. Unlike the conventional feature learning algorithm, MIVs-WBDA is a sample feature selection method based on the fusion of network variable selection and WBDA. Moreover, it involves the correlation of feature selection and the classified network and the correlation of the classified network and feature similarity. Hence, the MIVs-WBDA can partially overcome the drawbacks of linear classifications. The classification effect of noise measurement depends on the condition of the environment. Different feature selection may affect the final classification result under different operating environment, and the selection will be different when the environment changes. The common feature selection algorithm only maps the data and does not consider the influence of the data on the classifier. This paper mainly considers the influence of the features on the model classification and integrates the model classification and feature selection organically. The WBDA algorithm considers the generalization performance of the algorithm comprehensively. This paper demonstrates the running time and accuracy of MIVs-WBDA algorithm and several common feature selection algorithms. Finally, the results show that the MIVs WBDA algorithm has a good effect on the basis of considering time and classification accuracy. MIVs-WBDA feature extraction algorithm can screen out several features that are most conducive to classification, which has high application value in practice. MIVs-WBDA can select the most important features and exhibits enhanced classification performance, which realizes the unification of feature representation learning and machine prediction learning. Experiments show that under the condition of reducing to the same dimension, the classification accuracy for rotating machinery status using the MIVs-WBDA method has a 3% classification accuracy improvement under the two feature set construction methods. The typical running time of this classification learning algorithm is less than 10 s, while using deep learning; its running time will be more than a few hours. It should be noted that when the feature dimension is reduced to 1, the classification accuracy of the MIVs-WBDA algorithm is not high. It means that the best feature is not selected at this time, and we can consider how to introduce other strategies to solve the accuracy problem when the dimension is 1. In the later stage, the idea of feature extraction can be combined to achieve the improvement of classification performance in low dimensions. Of course, in practical applications, the dimension of the feature vector will not only take one dimension. Therefore, it will not affect the use of this algorithm. The idea of constructing diversity feature pool, end-to-end feature selection and prediction model learning can also be applied to other similar application scenarios.
Author Contributions: Y.N. proposed the part of the algorithm and wrote part of the program. G.W. and Y.Z. modified the algorithm and conducted experimental tests and algorithm simulations. G.W., Y.N., and Y.Z. wrote the paper. Y.Z. and J.Z. revised and edited the manuscript and supplemented some experiments. All authors have read and agreed to the published version of the manuscript.