TF Entropy and RFE Based Diagnosis for Centrifugal Pumps Subject to the Limitation of Failure Samples

: In practical engineering, the vibration-based fault diagnosis with few failure samples is gaining more and more attention from researchers, since it is generally hard to collect su ﬃ cient failure records of centrifugal pumps. In such circumstances, e ﬀ ective feature extraction becomes quite vital, since there may not be enough failure data to train an end-to-end classiﬁer, like the deep neural network (DNN). Among the feature extraction, the entropy combined with signal decomposition algorithms is a powerful choice for fault diagnosis of rotating machinery, where the latter decomposes the non-stationary signal into multiple sequences and the former further measures their nonlinear characteristics. However, the existing entropy generally aims at processing the 1D sequence, which means that it cannot simultaneously extract the fault-related information from both the time and frequency domains. Once the sequence is not strictly stationary (hard to achieve in practices), the useful information will be inevitably lost due to the ignored domain, thus limiting its performance. To solve the above issue, a novel entropy method called time-frequency entropy (TfEn) is proposed to jointly measure the complexity and dynamic changes, by taking into account nonlinear behaviors of sequences from both dimensions of time and frequency, which can still fully extract the intrinsic fault features even if the sequence is not strictly stationary. Successively, in order to eliminate the redundant components and further improve the diagnostic accuracy, recursive feature elimination (RFE) is applied to select the optimal features, which has better interpretability and performance, with the help of the supervised embedding mechanism. To sum up, we propose a novel two-stage method to construct the fault representation for centrifugal pumps, which develops from the TfEn-based feature extraction and RFE-based feature selection. The experimental results using the real vibration data of centrifugal pumps show that, with extremely few failure samples, the proposed method respectively improves the average classiﬁcation accuracy by 12.95% and 33.27%, compared with the mainstream entropy-based methods and the DNN-based ones, which reveals the advantage of our methodology.


Introduction
The centrifugal pump is one of the most critical elements in hydraulic systems [1], which has been widely applied to the modern industry. Due to long-term running in the harsh environment, there is an increasing need to develop and improve the technique of fault diagnosis for centrifugal pumps, to avoid the consequent loss of manpower and economy [2]. As the typical rotating equipment, To solve the above-mentioned issues, a novel fault representation method is designed and proposed in this paper, which is mainly composed of the time-frequency entropy (TfEn)-based feature extraction and recursive feature elimination (RFE)-based feature selection. Compared with the existing mainstream entropy-based methods aimed at processing the 1D sequence, our proposed method can jointly highlight and measure the intrinsic characteristics of nonlinear and non-stationary signals from both dimensions of time and frequency, and it can achieve more accurate fault diagnosis, even with limited failure samples. In the feature extraction stage, TfEn is proposed to comprehensively quantify the complexity and dynamic change by taking into account nonlinear behaviors of signals in the form of a 2D time-frequency matrix. In the feature selection stage, RFE is applied to select the optimal feature subset and eliminate the redundant components. Abandoning the feature fusion of PCA, RFE retains the original physical meanings of selected features, and thus has the better interpretability [21]. In addition, it is worth noting that our proposed method has no rigorous requirements for calculation resources and failure samples, which means that it more suitable for practical scenes with few failure samples. To sum up, the main contributions of this paper can be listed as follows:

1.
A novel feature extraction called TfEn is proposed to jointly quantify the robust nonlinear characteristics from both time and frequency dimensions, which overcomes the limitations of existing entropy based on 1D sequence and thus being more suitable for non-stationary signals; 2.
RFE-based feature selection is applied to select the optimal feature subset form the high-dimensional feature vector, which achieves better performance on improving the classification accuracy and retaining the feature interpretability compared with mainstream methods; 3.
The proposed two-stage fault representation has the capability of highlighting and depicting the intrinsic differences among vibration signals, which presents the stability against the variable sizes of training samples, and obtains the satisfied results even with extremely few failure data; 4.
Compared with the existing entropy variants and mainstream deep models, the experiments with the real vibration signals of centrifugal pumps illustrate the validity and superiority of our proposed fault diagnosis method.
The remaining of the paper is organized as follows. Section 2 introduces the necessary background. Section 3 describes the proposed fault diagnosis. Section 4 implements the experiments, then presents and discusses the results. Finally, the conclusion is shown in Section 5.

Preliminaries
This section presents the necessary background of our proposed method. Complementary ensemble empirical mode decomposition (CEEMD), is applied to process the raw signals of multiple components in this paper. Its principle is introduced as follows.
As an adaptive decomposition algorithm, CEEMD [22] can decompose non-stationary signals into several approximately stationary intrinsic mode functions (IMFs) and non-stationary residual terms. The brief process of CEEMD is described as follows:

1.
Given a raw signal x(t). First, add the particular white noise into the signal x(t) and decompose it. Repeat the decomposition under different noise realizations and average these results as the first sub-signal IMF 1 : where N is the average time, E 1 [] is the defined operator which produces the 1th mode obtained from the given signal using the EMD, σ 1 is the ratio coefficient and ω i is the particular white noise in each time.

CEEMD-Based Signal Processing
The signal processing is the first step of our proposed method. Typically, the vibration signal is composed of multiple narrowband time series. In order to extract the multi-scale intrinsic feature of signals, CEEMD is utilized to adaptively decompose the non-stationary raw signal into several sub-signals.
Among all the adaptive decomposition algorithms, EMD is the one that is earliest proposed and obtains the most applications in fault diagnosis. However, the EMD has the problem called mode mixing [23]. Although, EEMD, a variant of EMD had been proposed to deal with this problem through the ensemble of original signal with Gaussian noise. Considering all the above issues, CEEMD, a more complementary variant of EMD, is applied in this paper to obtain the better separated modes, which combines the improved noise addition and ensemble strategies [22].
As described in Section 2 given a raw vibration signal ( ) x t , it will be adaptively decomposed into S sub-signals , 1,2,..., Taking a vibration signal under the inner-race fault mode as the example, the 10 sub-signals decomposed by CEEMD is shown in Figure 2. Among all the adaptive decomposition algorithms, EMD is the one that is earliest proposed and obtains the most applications in fault diagnosis. However, the EMD has the problem called mode mixing [23]. Although, EEMD, a variant of EMD had been proposed to deal with this problem through the ensemble of original signal with Gaussian noise. Considering all the above issues, CEEMD, a more complementary variant of EMD, is applied in this paper to obtain the better separated modes, which combines the improved noise addition and ensemble strategies [22].
As described in Section 2 given a raw vibration signal x(t), it will be adaptively decomposed into S sub-signals IMF s , s = 1, 2, . . . , S. Taking a vibration signal under the inner-race fault mode as the example, the 10 sub-signals decomposed by CEEMD is shown in Figure 2.
As shown in Figure 2, 10 Imfs are sequentially decomposed in order of the frequency using CEEMD. We can see that the first 5 Imfs are approximately stationary, where their means and variances are approximately independent of time. However, as shown in the red dashed boxes in the figure, the last 5 Imfs present relatively more obvious non-stationary characteristics, namely, the means and variances of the series present large differences at different time instants.
To sum up, multiple narrowband sub-signals are decomposed from the non-stationary vibration signal using CEEMD, which depict the intrinsic characteristic of raw signals from the different scale. However, it should be noted that these components are not strictly stationary, which means that both time and frequency dimensions need to be considered jointly, so as to fully extract effective fault-related information.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 21 As shown in Figure 2, 10 Imfs are sequentially decomposed in order of the frequency using CEEMD. We can see that the first 5 Imfs are approximately stationary, where their means and variances are approximately independent of time. However, as shown in the red dashed boxes in the figure, the last 5 Imfs present relatively more obvious non-stationary characteristics, namely, the means and variances of the series present large differences at different time instants.
To sum up, multiple narrowband sub-signals are decomposed from the non-stationary vibration signal using CEEMD, which depict the intrinsic characteristic of raw signals from the different scale. However, it should be noted that these components are not strictly stationary, which means that both time and frequency dimensions need to be considered jointly, so as to fully extract effective faultrelated information.

Proposed TfEn-Based Feature Extraction
Entropy is the quantitative tool to describe the complexity and irregularity [24] of systems, which varies with the changing status of systems. For the centrifugal pumps, if the signal is approximately stationary (such as the first 5 Imfs shown in Figure 2), we usually think that the distribution of signals in the frequency domain better reflect the status of objects, where the existing entropy aiming at 1D sequence may be enough to measure their complexity. While, if the signal is close to non-stationary (such as the last 5 Imfs shown in Figure 2), both the time domain and frequency domain contain the useful information reflecting the status of objects. In this case, the existing entropy cannot comprehensively quantify the irregularity of signals in two dimensions, which inevitably bring about the loss of useful information.

Proposed TfEn-Based Feature Extraction
Entropy is the quantitative tool to describe the complexity and irregularity [24] of systems, which varies with the changing status of systems. For the centrifugal pumps, if the signal is approximately stationary (such as the first 5 Imfs shown in Figure 2), we usually think that the distribution of signals in the frequency domain better reflect the status of objects, where the existing entropy aiming at 1D sequence may be enough to measure their complexity. While, if the signal is close to non-stationary (such as the last 5 Imfs shown in Figure 2), both the time domain and frequency domain contain the useful information reflecting the status of objects. In this case, the existing entropy cannot comprehensively quantify the irregularity of signals in two dimensions, which inevitably bring about the loss of useful information.
Aiming at addressing the above-mentioned issues, we propose TfEn to jointly measure the complexity and dynamic change though taking into account the nonlinear behaviors from both dimensions of the time and frequency. The process of TfEn is introduced as follows.

1.
Given a 1D time series x(t). For the purpose of measuring its complexity both in time and frequency domain, x(t) should be first extended into the 2D time-frequency matrix. In this paper, short-time Fourier transform (STFT) [25], a classical time-frequency analysis method, is applied to achieve the above target considering its balance of calculation efficiency and performance.
where w(t) is the window function, t and τ are both the time variables and f is the frequency variable. X(t, f ) can be illustrated as the correlation between x(τ) and w(τ − t)e − j2π f τ .

2.
Further, through the square operation, the X(t, f ) will be transformed into the energy matrix.
where the M is a 2D matrix consisting of X × Y elements e ij , and e ij can be interpreted as the energy located at frequency i and time j.

3.
According to the Shannon methodology, the average degree of uncertainty depicts the irregularity of probability system. In this case, to obtain the robust representation of the time-frequency energy matrix M, we utilize a sliding window of size F × T to divide the M into I × J blocks W i,j , i = 1, . . . , I, j = 1, . . . , J. Then a normalization operator will process these blocks one by one and transform them into a probability density matrix Q. The above operation can be illustrated in Figure 3.
paper, short-time Fourier transform (STFT) [25] , a classical time-frequency analysis method, is applied to achieve the above target considering its balance of calculation efficiency and performance.
where ( ) w t is the window function, t and τ are both the time variables and f is the frequency variable.

( )
, X t f can be illustrated as the correlation between ( ) 2. Further, through the square operation, the ( ) , X t f will be transformed into the energy matrix.
where the M is a 2D matrix consisting of X Y × elements ij e , and ij e can be interpreted as the energy located at frequency i and time j .
3. According to the Shannon methodology, the average degree of uncertainty depicts the irregularity of probability system. In this case, to obtain the robust representation of the timefrequency energy matrix M , we utilize a sliding window of size F T × to divide the M into I J × blocks , , 1,..., , 1,..., Then a normalization operator will process these blocks one by one and transform them into a probability density matrix Q . The above operation can be illustrated in Figure 3.  As shown in Figure 3, the F, T, L f and L t are 4 hyper-parameters of our proposed TfEn, which respectively represent the span of frequency-axis, span of time-axis, translation span of frequency-axis and translation span of time-axis of each time-frequency block.

4.
As shown in Figure 3, through the normalization operators, each time-frequency energy block W i,j will be transformed into a probability density reflecting its irregularity. The calculation of the normalization operators is described as follows: where e x,y is an energy element of the matrix M at coordinate (x, y), w i, j is defined as the sum of energy values in the block W i,j , and A is defined as the total sum of energy elements of all blocks. q i,j is the normalized probability density for the block W i,j , which quantifies the concentration of energy in each local time-frequency block.

5.
According to the normalized probability density of all these blocks, the TfEn feature of the given time series x(t) can be extracted as follows: where c is an arbitrary constant and it is generally set as one, s(q) is the TfEn feature of 1D time series, which jointly represent its complexity and irregularity from both time and frequency domains. 6.
Repeating the above operations for all the S sequences decomposed by CEEMD, we will finally obtain the TfEn feature vector T f Ens of size 1 × S: with the help of T f Ens, we obtain the multi-scale representations reflecting the complexity and dynamic change of nonlinear and non-stationary vibration signals. Besides that, it is worth noting that TfEn requires no state space reconstruction and a large amount of loop comparisons, so it has higher computational efficiency than the methods such as SampleEn and ApEn.

RFE-Based Feature Selection
After the TfEn-based feature extraction, we obtain a feature set T f Ens representing the status of centrifugal pumps. The high-dimensional feature set may contain the sufficient status-related information, while it also has a price: increasing dimensions of the feature set leads to the sparseness of the feature space, which will cause the overfitting problem when the training samples are scare. Considering that there is usually not enough failure data in the practical engineering, the feature selection is a necessary operation to ensure the accuracy of fault diagnosis.
In this subsection, a supervised feature selection method called the recursive feature elimination (RFE) [26] is applied to solve the above-mentioned issues, which can automatically select the optimal feature subset from the high-dimensional feature set. The key steps of RFE-based feature selection are illustrated in Figure 4.  As shown in Figure 4, the RFE-based feature selection is an iterative process, which utilize the criteria developed from the coefficients in a support vector machine (SVM) model to assess features and recursively removes the features with the small criteria.

Given a set of training samples
x is the feature vector sample with S features, i y is the status label of centrifugal pumps. A linear SVM model is trained using these training samples, and its decision function is: 2. According to the trained SVM model, the ranking criterion of features can be calculated as follows: where i α is the Lagrange multiplier, ω is the weight vector of the trained SVM model with S elements s ω , ( ) J s is the ranking criterion for the feature ( ) s s q .
3. Eliminate the feature ( ) k s q with the lowest criterion from the feature set: As shown in Figure 4, the RFE-based feature selection is an iterative process, which utilize the criteria developed from the coefficients in a support vector machine (SVM) model to assess features and recursively removes the features with the small criteria.

1.
Given a set of training samples x i , y i , i = 1, . . . , N, where x i is the feature vector sample with S features, y i is the status label of centrifugal pumps. A linear SVM model is trained using these training samples, and its decision function is: 2.
According to the trained SVM model, the ranking criterion of features can be calculated as follows: where α i is the Lagrange multiplier, ω is the weight vector of the trained SVM model with S elements ω s , J(s) is the ranking criterion for the feature s(q) s .

3.
Eliminate the feature s(q) k with the lowest criterion from the feature set: where T f En 1 is the feature subset with S − 1 features in the first iteration of feature selection. 4.
Repeating the above operations until all the features are eliminated from the feature set, we will obtain the S feature subsets: using the cross-validation technique [27], the feature subset T f Ens * with the best accuracy will be selected as the optimal feature subset for the subsequent fault diagnosis.

Description of Dataset
In this experiment, the real vibration data of the centrifugal pumps is utilized to verify the effectiveness of out proposed fault diagnosis method. The vibration signals are acquired from an installed accelerometer, with a sampling rate of 10.24k Hz, and their engineering unit is millivolt (mV), which is shown in Figure 5.
using the cross-validation technique [27], the feature subset * TfEns with the best accuracy will be selected as the optimal feature subset for the subsequent fault diagnosis.

Description of Dataset
In this experiment, the real vibration data of the centrifugal pumps is utilized to verify the effectiveness of out proposed fault diagnosis method. The vibration signals are acquired from an installed accelerometer, with a sampling rate of 10.24k Hz, and their engineering unit is millivolt (mV), which is shown in Figure 5. There are 5 typical modes of centrifugal pumps in our dataset, which are respectively the innerrace bearing fault, outer-race bearing fault, ball fault, impeller fault and normal status. For each mode of the centrifugal pumps, we divide and obtain 60 samples according to the standard of 5120 points per sample, which means that there are a total of 5 × 60 = 300 samples in our dataset.
In order to test the performance of the proposed method under the scenarios of few failure samples, we divide the raw dataset into the training set and the testing set according to the ratio of 1: 5, which means that there are 10 samples under each fault mode to be used to train the fault classification model. In addition to that, in the Section 4.3, we set up additional several sets of experiments with training samples of smaller sizes, to further compare the adaptability on the extremely few failure samples between the proposed method and other existing mainstream ones.

Fault Diagnosis Using the Proposed Method
In this subsection, using the dataset divided by 10 training samples and 50 testing samples under each mode, we validate the proposed method, and the corresponding results of each key step will be presented and analyzed in the following.
A. CEEMD-based signal processing As the first step of our proposed fault diagnosis method, CEEMD is utilized to adaptively decompose the raw signal into several sub-signals called Imfs. The related hyper-parameters are set as follows: the noise standard deviation (Nstd) is set as 0.  There are 5 typical modes of centrifugal pumps in our dataset, which are respectively the inner-race bearing fault, outer-race bearing fault, ball fault, impeller fault and normal status. For each mode of the centrifugal pumps, we divide and obtain 60 samples according to the standard of 5120 points per sample, which means that there are a total of 5 × 60 = 300 samples in our dataset.
In order to test the performance of the proposed method under the scenarios of few failure samples, we divide the raw dataset into the training set and the testing set according to the ratio of 1: 5, which means that there are 10 samples under each fault mode to be used to train the fault classification model. In addition to that, in the Section 4.3, we set up additional several sets of experiments with training samples of smaller sizes, to further compare the adaptability on the extremely few failure samples between the proposed method and other existing mainstream ones.

Fault Diagnosis Using the Proposed Method
In this subsection, using the dataset divided by 10 training samples and 50 testing samples under each mode, we validate the proposed method, and the corresponding results of each key step will be presented and analyzed in the following.
A. CEEMD-based signal processing As the first step of our proposed fault diagnosis method, CEEMD is utilized to adaptively decompose the raw signal into several sub-signals called Imfs. The related hyper-parameters are set as follows: the noise standard deviation (Nstd) is set as 0.  As shown in Figures 6 and 7, using the CEEMD, each vibration signal is adaptively decomposed into 10 Imfs, which provide the multiple-scale representations for the status of centrifugal pumps. It is worth noting that some decomposed components still exhibit quite obvious non-stationarity characteristics, even when processed by CEEMD, which means that both their time domain and frequency domain need to be comprehensively considered during the subsequent feature extraction.

B. TfEn-based feature extraction
To more fully take advantage of the useful information, in the proposed TfEn, each decomposed Imf is first transformed into the time-frequency matrix, and then each 2D matrix is further divided and calculated to the energy density matrix using the normalization operator. The energy density matrix exhibits the irregularity and dynamic change from both the time and frequency domain, and it can be finally quantified as a TfEn feature.
Specifically, the related hyper-parameters of STFT are set as follows: the window length (window) is set as 512, the cover length (noverlap) is set as 510, the Fourier transform length (nfft) is set as 512 that is equal to the window length. And the hyper-parameters of normalization operator are defined: the span of frequency-axis (F) is set as 64, the span of time-axis (T) is set as 4, the translation span of frequency-axis (Lf) is set as 32 and the translation span of time-axis (Lt) is set as 2.
Repeating the above operations for all 10 Imfs decomposed by CEEMD, we will get the TfEn feature vector of size 10 1 × . Limited to the paper space, we only present the corresponding results of  As shown in Figures 6 and 7, using the CEEMD, each vibration signal is adaptively decomposed into 10 Imfs, which provide the multiple-scale representations for the status of centrifugal pumps. It is worth noting that some decomposed components still exhibit quite obvious non-stationarity characteristics, even when processed by CEEMD, which means that both their time domain and frequency domain need to be comprehensively considered during the subsequent feature extraction.

B. TfEn-based feature extraction
To more fully take advantage of the useful information, in the proposed TfEn, each decomposed Imf is first transformed into the time-frequency matrix, and then each 2D matrix is further divided and calculated to the energy density matrix using the normalization operator. The energy density matrix exhibits the irregularity and dynamic change from both the time and frequency domain, and it can be finally quantified as a TfEn feature.
Specifically, the related hyper-parameters of STFT are set as follows: the window length (window) is set as 512, the cover length (noverlap) is set as 510, the Fourier transform length (nfft) is set as 512 that is equal to the window length. And the hyper-parameters of normalization operator are defined: the span of frequency-axis (F) is set as 64, the span of time-axis (T) is set as 4, the translation span of frequency-axis (Lf) is set as 32 and the translation span of time-axis (Lt) is set as 2.
Repeating the above operations for all 10 Imfs decomposed by CEEMD, we will get the TfEn feature vector of size 10 1 × . Limited to the paper space, we only present the corresponding results of As shown in Figures 6 and 7, using the CEEMD, each vibration signal is adaptively decomposed into 10 Imfs, which provide the multiple-scale representations for the status of centrifugal pumps. It is worth noting that some decomposed components still exhibit quite obvious non-stationarity characteristics, even when processed by CEEMD, which means that both their time domain and frequency domain need to be comprehensively considered during the subsequent feature extraction.

B. TfEn-based feature extraction
To more fully take advantage of the useful information, in the proposed TfEn, each decomposed Imf is first transformed into the time-frequency matrix, and then each 2D matrix is further divided and calculated to the energy density matrix using the normalization operator. The energy density matrix exhibits the irregularity and dynamic change from both the time and frequency domain, and it can be finally quantified as a TfEn feature.
Specifically, the related hyper-parameters of STFT are set as follows: the window length (window) is set as 512, the cover length (noverlap) is set as 510, the Fourier transform length (nfft) is set as 512 that is equal to the window length. And the hyper-parameters of normalization operator are defined: the span of frequency-axis (F) is set as 64, the span of time-axis (T) is set as 4, the translation span of frequency-axis (L f ) is set as 32 and the translation span of time-axis (L t ) is set as 2.
Repeating the above operations for all 10 Imfs decomposed by CEEMD, we will get the TfEn feature vector of size 10 × 1. Limited to the paper space, we only present the corresponding results of the Im f 1 under the modes of normal and outer-race fault, which are shown in Figures 8 and 9.  As shown in Figure 8 and 9, 2D time-frequency matrices exhibit the difference of energy distribution among the Imf1s under the modes of normal and outer-race fault. Through the calculation of normalization operators block by block, these 2D matrices are further extracted and compressed into the energy density matrices of smaller size. The normalization operator is similar to the convolution filters in convolutional neural networks, which condenses each low-level local information into the high-level features. As shown by these energy density matrices, the irregularity of time-frequency distribution under each mode are further highlighted and depicted.
As described in Section 3.3, each density matrix is finally quantified as a TfEn feature, it measures the complexity and dynamic change of each Imf sub-signal from both time and frequency domains. Repeating the above operations for all the other Imfs, we obtain the TfEn feature vector including 10 elements.

C. RFE-based feature selection
For the purpose of further improving the performance of the fault diagnosis, the RFE algorithm is utilized to eliminate the redundant components from the feature set and then select the optimal feature subset. With the help of cross-validation technique, the above process of feature selection is totally automatic.
As described in Section 3.4, we finally obtain the ranking of features by training the linear SVM using the 5 10 50 × = training samples. The ranking of our 10 TfEn features is presented in Table 1.  As shown in Figure 8 and 9, 2D time-frequency matrices exhibit the difference of energy distribution among the Imf1s under the modes of normal and outer-race fault. Through the calculation of normalization operators block by block, these 2D matrices are further extracted and compressed into the energy density matrices of smaller size. The normalization operator is similar to the convolution filters in convolutional neural networks, which condenses each low-level local information into the high-level features. As shown by these energy density matrices, the irregularity of time-frequency distribution under each mode are further highlighted and depicted.
As described in Section 3.3, each density matrix is finally quantified as a TfEn feature, it measures the complexity and dynamic change of each Imf sub-signal from both time and frequency domains. Repeating the above operations for all the other Imfs, we obtain the TfEn feature vector including 10 elements.

C. RFE-based feature selection
For the purpose of further improving the performance of the fault diagnosis, the RFE algorithm is utilized to eliminate the redundant components from the feature set and then select the optimal feature subset. With the help of cross-validation technique, the above process of feature selection is totally automatic.
As described in Section 3.4, we finally obtain the ranking of features by training the linear SVM using the 5 10 50 × = training samples. The ranking of our 10 TfEn features is presented in Table 1. As shown in Figures 8 and 9, 2D time-frequency matrices exhibit the difference of energy distribution among the Imf 1 s under the modes of normal and outer-race fault. Through the calculation of normalization operators block by block, these 2D matrices are further extracted and compressed into the energy density matrices of smaller size. The normalization operator is similar to the convolution filters in convolutional neural networks, which condenses each low-level local information into the high-level features. As shown by these energy density matrices, the irregularity of time-frequency distribution under each mode are further highlighted and depicted.
As described in Section 3.3, each density matrix is finally quantified as a TfEn feature, it measures the complexity and dynamic change of each Imf sub-signal from both time and frequency domains. Repeating the above operations for all the other Imfs, we obtain the TfEn feature vector including 10 elements.

C. RFE-based feature selection
For the purpose of further improving the performance of the fault diagnosis, the RFE algorithm is utilized to eliminate the redundant components from the feature set and then select the optimal feature subset. With the help of cross-validation technique, the above process of feature selection is totally automatic.
As described in Section 3.4, we finally obtain the ranking of features by training the linear SVM using the 5 × 10 = 50 training samples. The ranking of our 10 TfEn features is presented in Table 1.  4 1 s(q) 5 6 s(q) 6 5 s(q) 7 4 s(q) 8 3 s(q) 9 2 s(q) 10 1 As shown in Table 1, a 5-dimensional feature vector including s(q) 1 , s(q) 2 , s(q) 3 , s(q) 4 and s(q) 10 is finally selected as the optimal feature subset T f Ens * . In the optimal feature subset, the first 4 Imfs occupy the dominant, which is consistent with the general experience and intuition of fault diagnosis. While contrary to common sense, the last 3 non-stationary components also obtain quite a high ranking, and the last Imf is also selected as the optimal feature. The above-mentioned phenomenon indicates that it is not reasonable to directly abandon the non-stationary components without analysis, and it also demonstrates that it is necessary to take into account both time and frequency domains during the process of feature extraction.
For the comparisons, another two mainstream feature selection methods, PCA [28] and kernel PCA (KPCA) [29], are also applied to select the optimal TfEn features. In order to understand the performance of these feature selection methods more intuitively, t-distributed stochastic neighbor embedding (t-SNE), a popular technique of dimensionality reduction, is applied to respectively transform the total raw features and the optimal features selected by PCA, KPCA, RFE into the 2D feature space, their corresponding scatter plots with 250 testing samples, are shown in Figure 10.
As shown in Figure 10, we can see that the scatters of features selected by RFE achieve the best resolution under 5 modes. Using these RFE-based optimal features, the difference of TfEn between different fault modes are further highlighted.

D. Fault classification
Considering the performance of our fault representations and the limitation of training samples, a classical classifier, SVM model, is employed to implement the fault classification.
To obtain the SVM model with the best accuracy, we utilize the grid search with cross validation (GSCV) technique [30] to train the SVM models with several hyper-parameters sets. Through the process of the GSCV, the SVM model with the hyper-parameter set C = 1.5, 'kernel = 'rb f , gamma = 'auto is selected as the final classifier to implement the fault classification for centrifugal pumps. In addition to that, the formula for accuracy is defined as Accuracy = Rights/(Rights + Errors) × 100%, and the corresponding diagnostic results are illustrated in Table 2.
As shown in Table 2, defining the size of training samples as 50 and using the SVM as the classifier model, the proposed method based on TfEn and RFE brings about the best results among all 4 methods, where its average accuracy is 100.00%, and no testing sample is misclassified.
Compared with total raw TfEn features, our proposed method improves the average accuracy by 1.20%. Compared with TfEn-PCA and TfEn-KPCA, our proposed method improves the average accuracy by 11.20% and 17.60%. The results preliminarily demonstrate the validity of TfEn, and the superiority of RFE over the mainstream PCA and KPCA-based feature selection methods.

Comparisons with Variable Taining Sample Sizes
In this subsection, two sets of representative methods are applied for comparisons, namely the entropy-based methods and the deep neural network (DNN)-based methods. In addition, in order to further test the adaptability of these methods to extremely few failure samples, a variety of training sample sizes are designed and applied for fault diagnosis • Category I entropy-based methods In this category, still applying CEEMD for the signal processing, 4 existing mainstream entropy-based feature extraction methods are utilized to implement the comparisons, which consist of the singular value decomposition entropy (SvdEn) [31], sample entropy (SampleEn) [32], approximately entropy (ApEn) [31] and spectral entropy (SpecEn) [33].
Additionally, in the feature selection stage, the PCA and KPCA are still utilized to compare with the RFE, which means that there are 3 classes of representations for each feature extraction method applied to achieve the fault diagnosis, namely, features selected by PCA, features selected by KPCA and features selected by RFE. In the fault classification stage, as in Section 4.2, there are still 10 samples under each mode applied to train an SVM classifier model. The corresponding results are shown in Figure 11. As shown in Table 2, defining the size of training samples as 50 and using the SVM as the classifier model, the proposed method based on TfEn and RFE brings about the best results among all 4 methods, where its average accuracy is 100.00%, and no testing sample is misclassified.
Compared with total raw TfEn features, our proposed method improves the average accuracy by 1.20%. Compared with TfEn-PCA and TfEn-KPCA, our proposed method improves the average accuracy by 11.20% and 17.60%. The results preliminarily demonstrate the validity of TfEn, and the superiority of RFE over the mainstream PCA and KPCA-based feature selection methods.

Comparisons with Variable Taining Sample Sizes
In this subsection, two sets of representative methods are applied for comparisons, namely the entropy-based methods and the deep neural network (DNN)-based methods. In addition, in order to further test the adaptability of these methods to extremely few failure samples, a variety of training sample sizes are designed and applied for fault diagnosis
Additionally, in the feature selection stage, the PCA and KPCA are still utilized to compare with the RFE, which means that there are 3 classes of representations for each feature extraction method applied to achieve the fault diagnosis, namely, features selected by PCA, features selected by KPCA and features selected by RFE. In the fault classification stage, as in Section 4.2, there are still 10 samples under each mode applied to train an SVM classifier model. The corresponding results are shown in Figure 11.  As shown in Figure 11, under the condition of 10 training samples per fault mode, the best testing accuracy (100%) is achieved by the proposed method mainly based on TfEn and RFE. For each entropy method, the RFE-based feature selection outperforms another two based on PCA and KPCA, and the advantages are quite obvious, especially for SvdEn and SpectEn, which illustrates the superiority of the RFE-based feature selection. In addition, using the same feature selection based on RFE, the proposed TfEn obtains a better classification accuracy that the other 4 existing entropy methods, where the accuracy is respectively improved by 12.80%, 5.20%, 5.60% and 6.40%. It further proves that our proposed TfEn can extract the more intrinsic fault representations than the other 4 mainstream entropy methods aimed at processing the 1D sequence, which is thus more suitable for the fault classification especially with few failure samples.
In addition to the performance on feature extraction, we compare the calculation efficiency among these entropy methods by the quantified running time, which is illustrated in Table 3. As shown in Table 3, due to the state space reconstruction and lots of loops, SampleEn and ApEn take 2862.31 s and 2740.10 s, respectively, to complete the feature extraction for each 50 samples. Although these two entropy methods obtain a slightly higher classification accuracy than SvdEn and SpectEn, the overly low calculation efficiency still makes them almost intolerable in practices.
To sum up, the proposed TfEn achieves the best balance between the performance and computation efficiency. Although its running time (1.61 s per 50 samples) is slightly longer than SvdEn (0.24 s per 50 samples) and SpectEn (0.14 s per 50 samples), it is still acceptable, taking into account its considerable performance on fault classification.
Besides that, in order to further test the adaptability of methods to limiting few failure samples, the training sample size under each fault mode is respectively set as 4, 6, 8, 10, 12 and 14 in this subsection. In addition to the above 3 deep models, the 5 entropy-based methods in Category I are also applied to the comparisons, which are respectively SvdEn-RFE, SampleEn-RFE, ApEn-RFE, SpecEn-RFE and our proposed TfEn-RFE. The corresponding results are illustrated in Figure 12 and Table 4. According to the comparative results, we can obtain the following conclusions of the proposed TfEn-RFE method.
(1) Stability for variable sizes of training samples As shown in Figure 12 and Table 5 For the four existing entropy methods (SvdEn-RFE, SampleEn-RFE, ApEn-RFE and SpectEn-RFE), their accuracy ranges are 14.11%, 3.79%, 3.44% and 16.49%, respectively. These entropy-based methods achieve the better stability than the DNN-based ones. However, their performance does not achieve significant improvements like the DNN-based ones with increasing training sizes, because they can only extract the features from a single domain (time or frequency), and thus lose useful information of those non-stationary Imfs.
For the proposed TfEn-RFE method, its accuracy range is only 0.74%. To sum up, the proposed TfEn-RFE method always maintains the best classification accuracy, and shows no obvious degradations, even with extremely few training samples, which exhibits the best stability for the variable training samples among all the methods.  According to the comparative results, we can obtain the following conclusions of the proposed TfEn-RFE method.
(1) Stability for variable sizes of training samples As shown in Figure 12 and Table 5, with variable sizes of training samples (4/6/8/10/12/14): For the three DNN methods (FFT-MLP, FFT-SAE and STFT-CNN), the ranges between the maximum accuracy and minimum accuracy are 20.37%, 39.29% and 30.00%, respectively. The DNN-based methods do have extraordinary performance when the training samples are relatively sufficient, with the help of their powerful capability in feature mining. However, their accuracy presents the obvious degradation with the extremely few training samples (size of 4 per fault mode), which results by their worst stability for variable training sizes among all the methods. For the four existing entropy methods (SvdEn-RFE, SampleEn-RFE, ApEn-RFE and SpectEn-RFE), their accuracy ranges are 14.11%, 3.79%, 3.44% and 16.49%, respectively. These entropy-based methods achieve the better stability than the DNN-based ones. However, their performance does not achieve significant improvements like the DNN-based ones with increasing training sizes, because they can only extract the features from a single domain (time or frequency), and thus lose useful information of those non-stationary Imfs.
For the proposed TfEn-RFE method, its accuracy range is only 0.74%. To sum up, the proposed TfEn-RFE method always maintains the best classification accuracy, and shows no obvious degradations, even with extremely few training samples, which exhibits the best stability for the variable training samples among all the methods.
(2) Significant improvement on classification accuracy with extremely few training samples As shown in Figure 12 and Table 5, under the sceneries of extremely few training samples (size of 4 per fault mode), the proposed TfEn improve the average classification accuracy by 33.27% compared with three mainstream DNN-based methods, and the average classification accuracy is improved by 12.95% compared with four mainstream entropy-based methods. Among all the methods, the DNN-based methods obtain the lowest average accuracy. This is because there are many parameters that need to be learned and updated in their complicated model architectures, and the scarce training samples are not enough to support this, and thus limit their performance. Compared to these DNN-based methods, the existing entropy-based methods present the relatively better adaptiveness to the scarce failure samples, while there is still room for improved performance due to the information loss caused by the ignored domain of time or frequency.
To sum up, through jointly taking into account of the nonlinear behaviors from both time and frequency domains, the proposed TfEn-RFE method can extract more intrinsic fault representation and thus achieve the best adaptiveness for the extremely few failure samples. With the help of the proposed method, the classification accuracy is significantly improved compared with the other comparative methods.

Discussion
Considering all the above-mentioned experimental results, the five main conclusions are listed as follows: 1. TfEn can extract more intrinsic fault features of non-stationary series from both time and frequency domains.
As shown in Figures 11 and 12, TfEn achieves the higher classification accuracy than all the other existing entropy methods. Through jointly taking into account nonlinear behaviors from both dimensions of time and frequency, TfEn can extract more intrinsic fault features, even if the decomposed components are non-stationary or approximately stationary (shown in Figures 8 and 9), which avoids the risk of losing useful information caused by existing entropy methods aimed at processing the 1D sequence.
2. TfEn achieves the good balance between the performance and calculation efficiency. Table 3, compared with the existing SampleEn (2862.31 s) and ApEn (2740.10 s), t TfEn presents the significantly shorter running time (1.61 s) but the higher classification, since it does not rely on the state space reconstruction and lots of loop comparisons that need considerable computation resources. Additionally, compared with another two entropy methods, SvdEn (0.24 s) and SpectEn (0.14 s), the running time of TfEn is slightly longer (1.61 s), while it brings about the considerable improvement on classification accuracy with variable training sizes (shown in Table 5). The relatively better balance between performance and calculation efficiency makes TfEn more suitable for practices.

As shown in
3. RFE exhibits superior performance and interpretability than the mainstream PCA and KPCA.
As shown in Figure 10 and Table 2, using the RFE, the difference of TfEn features among each fault mode is further highlighted, and its average classification accuracy is improved by 11.20% and 17.60%, compared with mainstream PCA and KPCA. For the other four entropy methods, the RFE also brings about the better results than the PCA and KPCA-based ones (shown in Figure 11 and Table 5), which reveals its superiority on the feature selection. Besides that, the optimal features selected by the RFE retain their original physical meanings and thus having the better interpretability (shown in Table 1). 4. Effective feature extraction designed manually is still necessary under the scenery of few failure samples.
As described in Section 4.3, various sizes of training samples (4/6/8/10/12/14) are defined to test the stability and adaptiveness for methods. According to Figure 12 and Table 5, we can see that, with the help of strong capability on the feature self-learning, the DNN-based methods do achieve superior performance (accuracy of 100%) when training sizes are relatively large (size of 12/14 per fault mode). However, their classification accuracy is overtaken by the entropy-based methods designed manually when the training samples are gradually decreased, which means that the effective manual feature extraction is still necessary for the practices where the failure data is usually hard to collect. 5. The proposed method provides a novel concept for fault diagnosis subject to the limitation of few failure samples.
With the help of TfEn and RFE, robust fault representations are constructed for centrifugal pumps, which presents the superior stability and adaptiveness for the variable sizes of training samples. When the training samples are extremely scarce (size of four per fault mode), the average classification accuracy is significantly improved by 12.95% and 33.27%, compared with the mainstream entropy-based methods and the DNN-based ones. Thanks to the remarkable fault representation, the proposed method has no requirements for massive data, and is thus more suitable for practical engineering.

Conclusions
In this paper, we proposed a TfEn and RFE based diagnosis method for centrifugal pumps, subject to the limitation of failure samples. Through jointly taking into account the complexity and dynamic change from both the domains of time and frequency, TfEn can highlight and extract more intrinsic fault features of non-stationary vibration signals for centrifugal pumps compared with the mainstream entropy methods. Using the RFE-based feature selection, the optimal fault features are automatically selected, and the diagnostic performance is further improved, and it should be noted that the selected features still retain their original physical meanings, and thus have better interpretability than the PCA. The experiments show that our proposed two-stage fault representation method presents stability against various training sizes, and superior performance with extremely few failure samples, which exhibits remarkable potential for further applications in various aspects of practical engineering.
In the future, we will further study the fault diagnosis method, considering the fusion of multi-type sensor signals, which may be more suitable for real industrial scenarios.