An Intelligent Fault Diagnosis Approach Considering the Elimination of the Weight Matrix Multi-Correlation

Zenghui An 1,* ID , Shunming Li 1, Jinrui Wang 1, Weiwei Qian 1 ID and Qijun Wu 2 1 College of Energy and Power Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing210000, China; smli@nuaa.edu.cn (S.L.); wangjr33@163.com (J.W.); qianweiwei33@163.com (W.Q.) 2 China Ship Development and Design Center, Wuhan 430064, China; cuanxinjia@163.com * Correspondence: me_anzh@163.com; Tel.: +86-188-5117-8176


Introduction
With the arrival of modern manufacturing systems, machines have become more automatic and efficient, which has led to increased demands on their reliability, quality, and availability [1,2].As a result, machinery fault diagnosis systems, which focus on the detection of health conditions after the occurrence of certain faults, have attracted considerable attention [3,4].Additionally, with recent developments in both industry and the Internet, data acquisition has exponentially increased.Thus, fault diagnosis has entered the era of Big Data [5][6][7].Because mechanical big data is typically characterized as large-volume, diverse, and of high-velocity [8], methods of extracting features rapidly and accurately from mechanical big data has become an urgent subject of research [9,10].Existing fault diagnosis methods can be divided into two major categories [11]: physics-based models and data-driven ones [12,13].Physics-based models overly rely on high-quality domain knowledge and necessitate massive computation costs, which reduces the overall efficiency of fault diagnosis.Thus, it is unsuitable for big data [14][15][16].The data-driven models [15], such as Artificial Neural Networks (ANN) [17,18], Autoencoders [5], Restricted Boltzmann Machine (RBM) [19], Convolutional Neural Networks (CNN) [20,21], and k-Nearest Neighbor [22], depend less on human knowledge and can effectively diagnose faults in mechanical big data.However, these intelligent fault diagnosis methods pose specific challenges, e.g., a difficulty in adjusting various hyperparameters, and good diagnosis accuracy can be obtained only when the hyperparameters are set properly.For example, Autoencoder has four hyperparameters to tune while RBM has six.
Ngiam [23] proposed a sparse filtering method: an unsupervised two-layer network which optimizes the sparsity distribution of the features calculated by the collected data instead of modeling the distribution of the data.It also scales well with the dimension of the input [24].Only the number of features needs to set; as a result, it is extremely simple to tune and easy to implement.Thanks to this strong performance, sparse filtering has been successfully adopted in several image recognition cases [25][26][27].Recently, sparse filtering also has been introduced into the field of rotary machinery fault diagnosis, delivering excellent performance in feature extraction from complex fault signals.As is well known, Lei et al. [28] first made a constructive attempt to apply sparse filtering to fault diagnosis by a two-stage learning method.As a result, sparse filtering was proved to be an excellent tool for feature extraction.Subsequently, a physical interpretation to sparse filtering was achieved, in which trained weight vectors could be compared directly to the Gabor filters.Zhao et al. [29] used sparse filtering to extract multi-domain sparse features and adopted it for fault identification in a planetary gearbox.Jiang et al. [30] proposed a multiscale representation learning (MSRL) framework, which was based on sparse filtering, to learn useful features from a direct interpretation of raw vibration signals; the aim was to capture rich and complementary fault pattern information at different scales.Yang et al. [31] combined sparse filtering with a Support Vector Machine optimized by a Improved Particle Swarm to simplify the hyperparameters.
However, as shown in Lei et al. [28], when the important parameter of input dimension increased, an initial increase in testing accuracies was followed by a marked decrease (in this paper, the rule is called non-monotonicity).This indicates considerable time is required to optimize the input dimension to increase diagnosis accuracy.To avoid this unnecessary work, the motivation behind non-monotonicity should be clearly explained.Therefore, the interpretation of input dimensions should be studied first.Next, the nature of non-monotonicity will be explained and, finally, a method to improve the performance of sparse filtering for fault diagnosis will be proposed.
To solve these problems, this paper is organized as follows.Section 2 introduces the algorithm of sparse filtering.Section 3 studies the interpretation of input dimension and explains the nature of the non-monotonicity phenomenon.Section 4 details the proposed method, which is based on the elimination of the multi-correlation of weight matrix.In Section 5, the diagnosis cases of bearing and gear datasets are adopted to validate the effectiveness of the proposed method.Finally, conclusions are drawn in Section 6.

Sparse Filtering
Sparse filtering is an unsupervised feature learning method that attempts to ensure learning features satisfy three principles: population sparsity, lifetime sparsity, and high dispersal [23].To realize these properties, sparse filtering trains a weight matrix through the optimization of a cost function.
As shown in Figure 1, the collected raw vibration signal is used as input data.Firstly, the vibration signal is separated into M samples to compose the training set x i M i=1 , where x i ∈ N×1 is a training sample which contains N data points.Then, the training set is used to train the sparse filtering model to obtain a weight matrix W ∈ N×L .Finally, each sample is mapped into a feature vector f i ∈ L×1 by the weight matrix.For sparse filtering, an activation function is needed to calculate the nonlinear features.We consider the situation that sparse filtering computes linear features for each sample:

Feature vector f i
where i j f is the jth feature value.
The feature matrix comprises the features i j f .Firstly, each row is normalized to be equally activated by its 2  -norm: Then, each column (or each sample) is normalized by its 2  -norm.As a result, each feature is constrained to lie on the unit 2  -ball: Finally, the normalized features are optimized for sparseness using the 1  penalty.For the training set , the sparse filtering objective is shown as follows: Generally, the activation function ( )  =  g is recommended and can be rewritten as the inner product form: This suggests sparse filtering can be interpreted as a measurement of the similarity between the input signals and a series of weight vectors, such as wavelet transform [32].

Nature of Input Dimension and Overfitting
In this section, the nature of non-monotonicity is studied.First, several fundamental laws are revealed using a series of harmonic signals.Then, the bearing fault signals are applied to further verify our interpretation.Finally, on the basis of the physical interpretation, the overfitting phenomenon is discovered and the nature of it is investigated.

Characteristics of Harmonic Signals
A harmonic signal y(t) is defined as follows: where A and θ are the amplitude and phase of y(t), respectively; fr denotes the rotational frequency.
The sampling rate fs of the signals is 10,000 Hz and sampling time t is 12 s.30 trials were carried out for each experiment in this section to reduce the effect of randomness.In addition, 10% of samples were randomly selected for training and the output dimension was set to 1.We consider the situation that sparse filtering computes linear features for each sample: where f i j is the jth feature value.The feature matrix comprises the features f i j .Firstly, each row is normalized to be equally activated by its 2-norm: Then, each column (or each sample) is normalized by its 2-norm.As a result, each feature is constrained to lie on the unit 2-ball: Finally, the normalized features are optimized for sparseness using the 1 penalty.For the training set x i M i=1 , the sparse filtering objective is shown as follows: Generally, the activation function g(•) = |•| is recommended and can be rewritten as the inner product form: This suggests sparse filtering can be interpreted as a measurement of the similarity between the input signals and a series of weight vectors, such as wavelet transform [32].

Nature of Input Dimension and Overfitting
In this section, the nature of non-monotonicity is studied.First, several fundamental laws are revealed using a series of harmonic signals.Then, the bearing fault signals are applied to further verify our interpretation.Finally, on the basis of the physical interpretation, the overfitting phenomenon is discovered and the nature of it is investigated.

Characteristics of Harmonic Signals
A harmonic signal y(t) is defined as follows: where A and θ are the amplitude and phase of y(t), respectively; f r denotes the rotational frequency.
The sampling rate f s of the signals is 10,000 Hz and sampling time t is 12 s.30 trials were carried out for each experiment in this section to reduce the effect of randomness.In addition, 10% of samples were randomly selected for training and the output dimension was set to 1.

Consider Different Initial Phases
Two harmonic signal groups of frequency (100 Hz and 130 Hz) with different initial phases are exploited through sparse filtering, where A = 1; θ = 0 • , 45     , 270 • .The classification results of different input dimensions are displayed in Figure 2. It was determined that, when the input dimension N in = 50, 100, 150, 200; f r = 100 Hz, the testing accuracies reached 100%.However, the majority of the testing accuracies were quite low.It is possible that sparse filtering was unable to recognize the information of the initial phase.

Consider Different Initial Phases
Two harmonic signal groups of frequency (100 Hz and 130 Hz) with different initial phases are exploited through sparse filtering, where A = 1; θ = 0°, 45°, 90°, 135°, 180°, 225°, 270°.The classification results of different input dimensions are displayed in Figure 2. It was determined that, when the input dimension Nin = 50, 100, 150, 200; fr = 100 Hz, the testing accuracies reached 100%.However, the majority of the testing accuracies were quite low.It is possible that sparse filtering was unable to recognize the information of the initial phase.

Consider Different Amplitudes
A batch of harmonic signals with different amplitudes were processed based on sparse filtering without loss of generality, where A = 1, 1.2, 1.4, 1.6, 1.8, 2, 2.2; θ = 0°; fr = 100 Hz. 100% accuracies were obtained with different input dimensions.Figure 3, shows the relationship between the amplitudes of samples and the learned features with different input dimensions.It indicates that the learned features are proportional to the amplitudes of samples.The scales are irregular and do not depend on input dimension.It means that sparse filtering could classify different types by the amplitude information; however, the input dimension does not affect the learned features of sparse filtering.Thus, the amplitude was set at one in the follow-up study.
Relationship between the amplitudes of samples and the learned features.

Consider Different Amplitudes
A batch of harmonic signals with different amplitudes were processed based on sparse filtering without loss of generality, where A = 1, 1.2, 1.4, 1.6, 1.8, 2, 2.2; θ = 0 • ; f r = 100 Hz. 100% accuracies were obtained with different input dimensions.Figure 3, shows the relationship between the amplitudes of samples and the learned features with different input dimensions.It indicates that the learned features are proportional to the amplitudes of samples.The scales are irregular and do not depend on input dimension.It means that sparse filtering could classify different types by the amplitude information; however, the input dimension does not affect the learned features of sparse filtering.Thus, the amplitude was set at one in the follow-up study.

Consider Different Initial Phases
Two harmonic signal groups of frequency (100 Hz and 130 Hz) with different initial phases are exploited through sparse filtering, where A = 1; θ = 0°, 45°, 90°, 135°, 180°, 225°, 270°.The classification results of different input dimensions are displayed in Figure 2. It was determined that, when the input dimension Nin = 50, 100, 150, 200; fr = 100 Hz, the testing accuracies reached 100%.However, the majority of the testing accuracies were quite low.It is possible that sparse filtering was unable to recognize the information of the initial phase.

Consider Different Amplitudes
A batch of harmonic signals with different amplitudes were processed based on sparse filtering without loss of generality, where A = 1, 1.2, 1.4, 1.6, 1.8, 2, 2.2; θ = 0°; fr = 100 Hz. 100% accuracies were obtained with different input dimensions.Figure 3, shows the relationship between the amplitudes of samples and the learned features with different input dimensions.It indicates that the learned features are proportional to the amplitudes of samples.The scales are irregular and do not depend on input dimension.It means that sparse filtering could classify different types by the amplitude information; however, the input dimension does not affect the learned features of sparse filtering.Thus, the amplitude was set at one in the follow-up study.
Relationship between the amplitudes of samples and the learned features.

Consider Different Rotational Frequencies
A set of rotational speeds were used to describe the frequency distinguishing ability of sparse filtering, where A = 1; θ = 0 • ; f r = 100, 150, 200, 250, 300, 350, 400 Hz. Figure 4 shows the diagnosis accuracies using various input dimensions.The testing accuracy of all input dimensions achieved 100%.This suggests that the learned features of sparse filtering could reflect the frequency information of vibration signals.However, when the input dimension increased, the averaged testing accuracy was higher.
Appl.Sci.2018, 8, x 5 of 17 A set of rotational speeds were used to describe the frequency distinguishing ability of sparse filtering, where A = 1; θ = 0°; fr = 100, 150, 200, 250, 300, 350, 400 Hz. Figure 4 shows the diagnosis accuracies using various input dimensions.The testing accuracy of all input dimensions achieved 100%.This suggests that the learned features of sparse filtering could reflect the frequency information of vibration signals.However, when the input dimension increased, the averaged testing accuracy was higher.To illustrate this phenomenon, we selected two different output dimensions of weight matrixes for comparison analysis, i.e., 100 and 200.The learned features and the spectra of weight matrixes are shown in Figure 5.It is seen in Figure 5a that the frequency interval in the spectra of weight matrixes was equal to that in the samples, and the amplitudes of spectra were proportional to the features with the corresponding frequency.This resulted in steady, clear, and different learned features within the various samples.However, the frequency interval of weight matrixes was equal to 100 Hz in Figure 5b, twice the size of the frequency interval of samples.The features of several samples, whose frequencies were 150, 250, 350 Hz, depended on the amplitudes of adjacent frequencies in the spectra of the weight matrix.This also resulted in unclear learned features, as in Figure 5b.Therefore, the frequency resolution of weight matrix depends on the input dimension.
From the inspection of the properties of a series of harmonic signals, several conclusions can be summarized: (1) Sparse filtering is unable to classify the initial phase information.
(2) Sparse filtering can recognize the amplitude information, but the input dimension does not affect the learned features of sparse filtering.(3) The learned features of sparse filtering can reflect the frequency information.Additionally, the frequency resolution of weight matrix depends on the input dimension.The features are unstable when the input dimension reduces in size.To illustrate this phenomenon, we selected two different output dimensions of weight matrixes for comparison analysis, i.e., 100 and 200.The learned features and the spectra of weight matrixes are shown in Figure 5.It is seen in Figure 5a that the frequency interval in the spectra of weight matrixes was equal to that in the samples, and the amplitudes of spectra were proportional to the features with the corresponding frequency.This resulted in steady, clear, and different learned features within the various samples.However, the frequency interval of weight matrixes was equal to 100 Hz in Figure 5b, twice the size of the frequency interval of samples.The features of several samples, whose frequencies were 150, 250, 350 Hz, depended on the amplitudes of adjacent frequencies in the spectra of the weight matrix.This also resulted in unclear learned features, as in Figure 5b.Therefore, the frequency resolution of weight matrix depends on the input dimension.
From the inspection of the properties of a series of harmonic signals, several conclusions can be summarized: (1) Sparse filtering is unable to classify the initial phase information.
(2) Sparse filtering can recognize the amplitude information, but the input dimension does not affect the learned features of sparse filtering.(3) The learned features of sparse filtering can reflect the frequency information.Additionally, the frequency resolution of weight matrix depends on the input dimension.The features are unstable when the input dimension reduces in size.

Data Description
In this section, the bearing dataset [33] provided by Case Western Reserve University were employed for analysis.The vibration signals were collected using accerometers from the drive end of a motor under four different operating conditions: normal condition (NC), inner race fault (IF), outer race fault (OF), and roller fault (RF).There were three different severity levels (0.18, 0.36, and 0.53 mm) for IF, OF, and RF cases, respectively.All the samples were collected under four different loads (0, 1, 2, and 3 hp) and the sampling frequency was 12 kHz.Therefore, the dataset included ten health states under four loads, and we treated the same health state under different loads as one class.

The influence of Input Dimension of Sparse Filtering
The method of Lei et al. [28] was adopted to engage with the experiment signals described above.The selection of input dimension Nin of sparse filtering was investigated.Softmax regression was adopted as a classifier and the diagnosis results are shown in Figure 6.It can be seen that the testing accuracy decreased after an initial increase with the input dimension.Therefore, excessive human labor was required to select the input dimension for the high testing accuracy.To overcome this deficiency, we decided to provide a clear explanation of the nature of the input dimension.
The bearing signals were employed to verify the above explanation of the input dimension on the basis of harmonic signals.As shown in Figure 7a, two weight vectors of Nin = 50 and 100 were selected respectively, and their spectra are shown in Figure 7b.It was determined that the weight vectors were striking similar to the one-dimensional Gabor filter which served as an excellent band-pass filter for signals.The Gabor function is shown as follows [34]: where A, ω, and ϕ are the amplitude, spatial frequency, and phase of the cosine term, respectively, σ is the standard deviation of the Gaussian, D denotes a position offset, and B is an offset parameter.The two different weight vectors exert the same bandwidth, which means that the features extracted by them theoretically have the same frequency information.However, when the input dimension diminishes, the frequency interval of weight matrix also shrinks.As a result, the energy of frequency is dispersed, learning to unclear learned features, which results in lower testing accuracy.

Data Description
In this section, the bearing dataset [33] provided by Case Western Reserve University were employed for analysis.The vibration signals were collected using accerometers from the drive end of a motor under four different operating conditions: normal condition (NC), inner race fault (IF), outer race fault (OF), and roller fault (RF).There were three different severity levels (0.18, 0.36, and 0.53 mm) for IF, OF, and RF cases, respectively.All the samples were collected under four different loads (0, 1, 2, and 3 hp) and the sampling frequency was 12 kHz.Therefore, the dataset included ten health states under four loads, and we treated the same health state under different loads as one class.

The influence of Input Dimension of Sparse Filtering
The method of Lei et al. [28] was adopted to engage with the experiment signals described above.The selection of input dimension N in of sparse filtering was investigated.Softmax regression was adopted as a classifier and the diagnosis results are shown in Figure 6.It can be seen that the testing accuracy decreased after an initial increase with the input dimension.Therefore, excessive human labor was required to select the input dimension for the high testing accuracy.To overcome this deficiency, we decided to provide a clear explanation of the nature of the input dimension.
The bearing signals were employed to verify the above explanation of the input dimension on the basis of harmonic signals.As shown in Figure 7a, two weight vectors of N in = 50 and 100 were selected respectively, and their spectra are shown in Figure 7b.It was determined that the weight vectors were striking similar to the one-dimensional Gabor filter which served as an excellent band-pass filter for signals.The Gabor function is shown as follows [34]: where A, ω, and φ are the amplitude, spatial frequency, and phase of the cosine term, respectively, σ is the standard deviation of the Gaussian, D denotes a position offset, and B is an offset parameter.
The two different weight vectors exert the same bandwidth, which means that the features extracted by them theoretically have the same frequency information.However, when the input dimension diminishes, the frequency interval of weight matrix also shrinks.As a result, the energy of frequency is dispersed, learning to unclear learned features, which results in lower testing accuracy.

The Nature of the Overfitting Phenomenon
As seen in the above discussion, the change of input dimension affects testing accuracy through its influence on the frequency resolution of the weight matrix.When the input dimension increased, the testing accuracy also should have increased.However, as seen in Figure 6, when the input dimension was larger than 100, testing accuracy was reduced.Training accuracy did not reduce when the input dimension was increased, suggesting that sparse filtering cannot extract discriminative features for a testing dataset, even though the weight matrix has perfect frequency resolution.This phenomenon is called overfitting.
To explain the nature of overfitting, we use WW T to measure the similarity of the row vectors respectively when Nin = 100 and Nin = 200.The results can be seen in Figure 8a,b; when the input dimension is 100, the inner product of the same row vector of approaches 1, and the inner product between the two different vectors is close to 0. However, when the input dimension is 200, the inner product of vectors of the weight matrix indicates that they have similar patterns.This suggests that sparse filtering can encourage the learned weight matrix to display clear and distinct patterns when the input dimension is smaller.15 row vectors of W, trained by sparse filtering with Nin = 100, were randomly selected and plotted in Figure 9a.The corresponding frequency spectra of these row vectors are displayed in Figure 9b.The row vectors show local striking in the time domain and they also dominate the narrow spectral bandwidth in the frequency domain.Accordingly, these row vectors exhibit time-frequency properties and show similarities to one-dimensional Gabor functions, which serve as good band-pass bases for mechanical signals.For comparison, we randomly plotted

The Nature of the Overfitting Phenomenon
As seen in the above discussion, the change of input dimension affects testing accuracy through its influence on the frequency resolution of the weight matrix.When the input dimension increased, the testing accuracy also should have increased.However, as seen in Figure 6, when the input dimension was larger than 100, testing accuracy was reduced.Training accuracy did not reduce when the input dimension was increased, suggesting that sparse filtering cannot extract discriminative features for a testing dataset, even though the weight matrix has perfect frequency resolution.This phenomenon is called overfitting.
To explain the nature of overfitting, we use WW T to measure the similarity of the row vectors respectively when Nin = 100 and Nin = 200.The results can be seen in Figure 8a,b; when the input dimension is 100, the inner product of the same row vector of approaches 1, and the inner product between the two different vectors is close to 0. However, when the input dimension is 200, the inner product of vectors of the weight matrix indicates that they have similar patterns.This suggests that sparse filtering can encourage the learned weight matrix to display clear and distinct patterns when the input dimension is smaller.15 row vectors of W, trained by sparse filtering with Nin = 100, were randomly selected and plotted in Figure 9a.The corresponding frequency spectra of these row vectors are displayed in Figure 9b.The row vectors show local striking in the time domain and they also dominate the narrow spectral bandwidth in the frequency domain.Accordingly, these row vectors exhibit time-frequency properties and show similarities to one-dimensional Gabor functions, which serve as good band-pass bases for mechanical signals.For comparison, we randomly plotted

The Nature of the Overfitting Phenomenon
As seen in the above discussion, the change of input dimension affects testing accuracy through its influence on the frequency resolution of the weight matrix.When the input dimension increased, the testing accuracy also should have increased.However, as seen in Figure 6, when the input dimension was larger than 100, testing accuracy was reduced.Training accuracy did not reduce when the input dimension was increased, suggesting that sparse filtering cannot extract discriminative features for a testing dataset, even though the weight matrix has perfect frequency resolution.This phenomenon is called overfitting.
To explain the nature of overfitting, we use WW T to measure the similarity of the row vectors respectively when N in = 100 and N in = 200.The results can be seen in Figure 8a,b; when the input dimension is 100, the inner product of the same row vector of approaches 1, and the inner product between the two different vectors is close to 0. However, when the input dimension is 200, the inner product of vectors of the weight matrix indicates that they have similar patterns.This suggests that sparse filtering can encourage the learned weight matrix to display clear and distinct patterns when the input dimension is smaller.15 row vectors of W, trained by sparse filtering with N in = 100, were randomly selected and plotted in Figure 9a.The corresponding frequency spectra of these row vectors are displayed in Figure 9b.The row vectors show local striking in the time domain and they also dominate the narrow spectral bandwidth in the frequency domain.Accordingly, these row vectors exhibit time-frequency properties and show similarities to one-dimensional Gabor functions, which serve as good band-pass bases for mechanical signals.For comparison, we randomly plotted 15 row vectors of W trained by sparse filtering of N in = 200 in Figure 9c,d.These vectors also exhibited some time-frequency properties and wide spectral bandwidth in the frequency domain.The results demonstrate that the clearer and more distinct the time-frequency properties of the trained weight matrix are, the better the diagnosis performance of the method.Therefore, sparse filtering encourages learned features to be discriminative to improve the testing accuracy.

Modified Sparse Filtering and Two-Stage Learning Method
In this section, a novel method known as modified sparse filtering is proposed, which aims to resolve the problem of overfitting.From the above discussion, the nature of the overfitting phenomenon is that different vectors of the weight matrix exhibit similar patterns.To suppress the similarity of the row vector, a constraint condition is applied to the cost function, which is shown as follows: where λ is the tuning parameter, ωiωj are the vectors of the weight matrix.The constraint condition represents the sum of the inner product among different row vectors.A two-stage learning method is proposed for intelligent fault diagnosis of machines based on the modified sparse filtering.The illustration and flowchart of the method are shown in Figure 10.In the first learning stage, modified sparse filtering is used to extract local discriminative features from raw vibration signals and the learned features of the signals are obtained by averaging these local features.In the second stage, softmax regression is applied to classify mechanical health conditions using the learned features.
where E denotes the orthogonal matrix of eigenvectors of the covariance matrix cov(S T ), and D is the diagonal matrix of the eigenvalues.The whitened training set Sw can then be obtained as follows:

Modified Sparse Filtering and Two-Stage Learning Method
In this section, a novel method known as modified sparse filtering is proposed, which aims to resolve the problem of overfitting.From the above discussion, the nature of the overfitting phenomenon is that different vectors of the weight matrix exhibit similar patterns.To suppress the similarity of the row vector, a constraint condition is applied to the cost function, which is shown as follows: where λ is the tuning parameter, ω i ω j are the vectors of the weight matrix.The constraint condition represents the sum of the inner product among different row vectors.
A two-stage learning method is proposed for intelligent fault diagnosis of machines based on the modified sparse filtering.The illustration and flowchart of the method are shown in Figure 10.In the first learning stage, modified sparse filtering is used to extract local discriminative features from raw vibration signals and the learned features of the signals are obtained by averaging these local features.In the second stage, softmax regression is applied to classify mechanical health conditions using the learned features.

1.
Collect signals.The vibration signals of machines are obtained under different health conditions.
These signals compose the training set x i , y i M i=1 , where x i ∈ N×1 is the ith sample containing M vibration data points and y i is the health condition label.We collect N s segments from each sample to compose the training set s j N s j=1 by an overlapped manner, where s j ∈ N in ×1 is the jth segment containing N in data points.The set s j N s j=1 is rewritten as a matrix form S j ∈ N in ×N s .

2.
Whitening.It is necessary to pre-process S by whitening.Whitening uses the eigenvalue decomposition of the covariance matrix where E denotes the orthogonal matrix of eigenvectors of the covariance matrix cov(S T ), and D is the diagonal matrix of the eigenvalues.The whitened training set S w can then be obtained as follows:

3.
Train sparse filtering.S w is employed to train the modified sparse filtering model; as a result, the weight matrix W is obtained by minimizing Equation ( 8).

4.
Calculate the local features.The training sample x i is alternately divided into K segments, where , where x k i ∈ N in ×1 .The local features f i k ∈ 1×N in can be calculated from each training sample x k i by the weight matrix W.

5.
Obtain the learned features.The local features f k i are combined into a feature vector f i by averaging, and f i is the learned feature vector: where K = N/Nin.These segments constitute a set , where x by the weight matrix W. 5. Obtain the learned features.The local features i k f are combined into a feature vector i f by averaging, and i f is the learned feature vector: 6. Train softmax regression.Once the learned feature set is obtained, we combine it with the label set to train softmax regression.

Case Study 1: Fault Diagnosis of Motor Bearing
The bearing dataset provided by Case Western Reserve University [25] is analyzed in this section.The dataset is detailed in Section 3.2.1.The dataset included ten health classes under four loads, and we treated the same health state under different loads as one class.Additionally, 15 trials were carried out for each experiment to reduce the randomness effect.
First, we investigated the selection of the input dimension Nin of modified sparse filtering.10% of the samples were randomly selected to train the proposed model.The remaining samples were equally divided into testing (to adjust the parameters) and validation dataset.The weight decay term of softmax regression was equal to 1 × 10 −5 and λ = 1.The output dimension Nout was half of the input dimension.The diagnosis results, in comparison to the original method with different Nin, are displayed in Figure 11.As seen in the figure, all testing accuracies of the proposed method were over 98.1%.Although the non-monotonicity phenomenon was still present, the testing accuracies of the proposed method were higher than the original method.This suggests that the deficiency of parameter selection was improved.

Case Study 1: Fault Diagnosis of Motor Bearing
The bearing dataset provided by Case Western Reserve University [25] is analyzed in this section.The dataset is detailed in Section 3.2.1.The dataset included ten health classes under four loads, and we treated the same health state under different loads as one class.Additionally, 15 trials were carried out for each experiment to reduce the randomness effect.
First, we investigated the selection of the input dimension N in of modified sparse filtering.10% of the samples were randomly selected to train the proposed model.The remaining samples were equally divided into testing (to adjust the parameters) and validation dataset.The weight decay term of softmax regression was equal to 1 × 10 −5 and λ = 1.The output dimension N out was half of the input dimension.The diagnosis results, in comparison to the original method with different N in , are displayed in Figure 11.As seen in the figure, all testing accuracies of the proposed method were over 98.1%.Although the non-monotonicity phenomenon was still present, the testing accuracies of the proposed method were higher than the original method.This suggests that the deficiency of parameter selection was improved.The overfitting phenomenon always arises when the training dataset is small.Therefore, we investigated the selection of proper percentages in training samples.The diagnosis accuracies are shown in Figure 12.The testing accuracies of the proposed method are higher than the original method in each condition.Furthermore, the proposed method obtained 95.3% accuracy, with a small standard deviation of 1.03% using only 1% of samples for training.Such results indicate that the proposed method performs well and can overcome the overfitting problem.
We selected the parameters of Nin = 100, Nout = 100, and λ = 1; softmax regression was equal to 1 × 10 −5 .The average accuracy of validation dataset was 99.85%, higher than the 99.66% obtained by the original method.To compare features, t-SNE [35] is used.This technique enabled us to embed these 100-D vectors in a 3D image in such a way that the vectors which were in close proximity to each other in the 100-D space were also in close proximity in the 3D plot [36].The results of validation dataset processed by t-SNE are shown in Figure 13.The mapped features of the different types are demonstrably separated, and the features of the same type are gathered together; the distance between each type is large enough to distinguish different health conditions.The overfitting phenomenon always arises when the training dataset is small.Therefore, we investigated the selection of proper percentages in training samples.The diagnosis accuracies are shown in Figure 12.The testing accuracies of the proposed method are higher than the original method in each condition.Furthermore, the proposed method obtained 95.3% accuracy, with a small standard deviation of 1.03% using only 1% of samples for training.Such results indicate that the proposed method performs well and can overcome the overfitting problem.The overfitting phenomenon always arises when the training dataset is small.Therefore, we investigated the selection of proper percentages in training samples.The diagnosis accuracies are shown in Figure 12.The testing accuracies of the proposed method are higher than the original method in each condition.Furthermore, the proposed method obtained 95.3% accuracy, with a small standard deviation of 1.03% using only 1% of samples for training.Such results indicate that the proposed method performs well and can overcome the overfitting problem.
We selected the parameters of Nin = 100, Nout = 100, and λ = 1; softmax regression was equal to 1 × 10 −5 .The average accuracy of validation dataset was 99.85%, higher than the 99.66% obtained by the original method.To compare features, t-SNE [35] is used.This technique enabled us to embed these 100-D vectors in a 3D image in such a way that the vectors which were in close proximity to each other in the 100-D space were also in close proximity in the 3D plot [36].The results of validation dataset processed by t-SNE are shown in Figure 13.The mapped features of the different types are demonstrably separated, and the features of the same type are gathered together; the distance between each type is large enough to distinguish different health conditions.We selected the parameters of N in = 100, N out = 100, and λ = 1; softmax regression was equal to 1 × 10 −5 .The average accuracy of validation dataset was 99.85%, higher than the 99.66% obtained by the original method.To compare features, t-SNE [35] is used.This technique enabled us to embed these 100-D vectors in a 3D image in such a way that the vectors which were in close proximity to each other in the 100-D space were also in close proximity in the 3D plot [36].The results of validation dataset processed by t-SNE are shown in Figure 13.The mapped features of the different types are demonstrably separated, and the features of the same type are gathered together; the distance between each type is large enough to distinguish different health conditions.As shown in Section 3, the nature of overfitting produces unclear and similar patterns for different vectors of the weight matrix.Figure 14 shows the inner product result of the weight matrix of the proposed method when the input dimension is equal to 200. 15 row vectors of W were randomly selected and plotted in Figure 15a.The corresponding frequency spectra of them are displayed in Figure 15b; notably, most of the row vectors are approximately orthogonal.Additionally, the bandwidths of the row vectors are narrow in the frequency domain, as shown in Figure 15b.This suggests that the modified sparse filtering can extract more discriminative features with less redundancy.As a result, the accuracies obtained by the proposed method are higher because the learned features are constrained to be more meaningful and dissimilar.As shown in Section 3, the nature of overfitting produces unclear and similar patterns for different vectors of the weight matrix.Figure 14 shows the inner product result of the weight matrix of the proposed method when the input dimension is equal to 200. 15 row vectors of W were randomly selected and plotted in Figure 15a.The corresponding frequency spectra of them are displayed in Figure 15b; notably, most of the row vectors are approximately orthogonal.Additionally, the bandwidths of the row vectors are narrow in the frequency domain, as shown in Figure 15b.This suggests that the modified sparse filtering can extract more discriminative features with less redundancy.As a result, the accuracies obtained by the proposed method are higher because the learned features are constrained to be more meaningful and dissimilar.As shown in Section 3, the nature of overfitting produces unclear and similar patterns for different vectors of the weight matrix.Figure 14 shows the inner product result of the weight matrix of the proposed method when the input dimension is equal to 200. 15 row vectors of W were randomly selected and plotted in Figure 15a.The corresponding frequency spectra of them are displayed in Figure 15b; notably, most of the row vectors are approximately orthogonal.Additionally, the bandwidths of the row vectors are narrow in the frequency domain, as shown in Figure 15b.This suggests that the modified sparse filtering can extract more discriminative features with less redundancy.As a result, the accuracies obtained by the proposed method are higher because the learned features are constrained to be more meaningful and dissimilar.

Case Study 2: Fault Diagnosis of Gearbox
The common faults of gears include local faults (pitting and teeth broken), distribution faults of worn types, and multiple coupled faults.Accurate fault identification is necessary for the safety of a mechanical system.In this section, a gearbox experimental dataset under different speeds was employed to validate the robustness of the proposed method.The test data were gathered on the gearbox platform shown in Figure 16, which consisted of a gearbox, a diesel engine, a bearing seat, a flexible coupling, a base, etc.The speed of the test system was controlled by electrical machinery.The gearbox contained two gears (pinion and wheel gear); their parameters are shown in Table 1.When the diesel engine ran, the Signal-Noise-Ratio was small.Therefore, the dataset validated the robustness to the noise of the proposed algorithm.The gear dataset contained four kinds of faults: a coupled fault of wheel pit and pinion worn, a single pit of wheel, coupled fault of wheel broken and pinion worn, and a single worn of pinion.

Case Study 2: Fault Diagnosis of Gearbox
The common faults of gears include local faults (pitting and teeth broken), distribution faults of worn types, and multiple coupled faults.Accurate fault identification is necessary for the safety of a mechanical system.In this section, a gearbox experimental dataset under different speeds was employed to validate the robustness of the proposed method.The test data were gathered on the gearbox platform shown in Figure 16, which consisted of a gearbox, a diesel engine, a bearing seat, a flexible coupling, a base, etc.The speed of the test system was controlled by electrical machinery.The gearbox contained two gears (pinion and wheel gear); their parameters are shown in Table 1.When the diesel engine ran, the Signal-Noise-Ratio was small.Therefore, the dataset validated the robustness to the noise of the proposed algorithm.

Case Study 2: Fault Diagnosis of Gearbox
The common faults of gears include local faults (pitting and teeth broken), distribution faults of worn types, and multiple coupled faults.Accurate fault identification is necessary for the safety of a mechanical system.In this section, a gearbox experimental dataset under different speeds was employed to validate the robustness of the proposed method.The test data were gathered on the gearbox platform shown in Figure 16, which consisted of a gearbox, a diesel engine, a bearing seat, a flexible coupling, a base, etc.The speed of the test system was controlled by electrical machinery.The gearbox contained two gears (pinion and wheel gear); their parameters are shown in Table 1.When the diesel engine ran, the Signal-Noise-Ratio was small.Therefore, the dataset validated the robustness to the noise of the proposed algorithm.The gear dataset contained four kinds of faults: a coupled fault of wheel pit and pinion worn, a single pit of wheel, coupled fault of wheel broken and pinion worn, and a single worn of pinion.For convenience, the above four types of gear faults are named Type-2, Type-3, Type-4 and Type-5,  The gear dataset contained four kinds of faults: a coupled fault of wheel pit and pinion worn, a single pit of wheel, coupled fault of wheel broken and pinion worn, and a single worn of pinion.For convenience, the above four types of gear faults are named Type-2, Type-3, Type-4 and Type-5, respectively.In addition, the normal state of the gear is referred to as Type-1.Each fault type was tested under three different speeds.Note that the speed fluctuations also existed among different faults and the values of speeds are shown in Table 2.The dataset of gearbox was averagely divided into two non-overlapping parts which were used as the training set and testing set.The proposed method achieved excellent performance in its application to the gearbox fault diagnosis.The diagnosis results, compared with the original method by using different N in , are displayed in Figure 17; notably, the paramater setting is the same as the bearing case.As seen in the figure, testing accuracies improved greatly.In addition, the non-montonicity phenomenon was significantly weak.The diagnosis accuracies with different percentages of training samples are shown in Figure 18.The experimental results show that the proposed method can effectively identify the gear health conditions with different fault types and severities, exhibiting a better performance than the original method.2. The dataset of gearbox was averagely divided into two non-overlapping parts which were used as the training set and testing set.The proposed method achieved excellent performance in its application to the gearbox fault diagnosis.The diagnosis results, compared with the original method by using different Nin, are displayed in Figure 17; notably, the paramater setting is the same as the bearing case.As seen in the figure, testing accuracies improved greatly.In addition, the non-montonicity phenomenon was significantly weak.The diagnosis accuracies with different percentages of training samples are shown in Figure 18.The experimental results show that the proposed method can effectively identify the gear health conditions with different fault types and severities, exhibiting a better performance than the original method.To demonstrate the diagnosis information, the confusion matrix of the proposed method is in Figure 19.Notably, the proposed method misclassifies 0.01% of testing samples of Type-2 as Type-1 and 0.02% of testing samples of Type-5 as Type-4.It is possible that the concurrent types were similar, making it more difficult to classify the two faults than other types.To demonstrate the diagnosis information, the confusion matrix of the proposed method is displayed in Figure 19.Notably, the proposed method misclassifies 0.01% of testing samples of Type-2 as Type-1 and 0.02% of testing samples of Type-5 as Type-4.It is possible that the concurrent types were similar, making it more difficult to classify the two faults than other types.

Conclusions
This paper proposed a modified sparse filtering for machinery intelligent fault diagnosis by studying the nature of input dimension and overfitting.As illustrated in the experiments, the proposed method can effectively extract useful features from different fault types and achieve a higher diagnosis accuracy than the original method.The following major conclusions can be drawn.
1.The interpretation of input dimension is studied based on the harmonic signal groups and bearing vibration signals.It can be concluded that the frequency resolution of weight matrix depends on input dimension.2. The phenomenon known as non-monotonicity in this paper is explained as overfitting, which results from row vectors of weight matrix which are not orthogonal.3. The modified sparse filtering with a constraint term in the cost function can effectively handle the overfitting problem and eliminate the multi-correlation of the weight matrix.

Conclusions
This paper proposed a modified sparse filtering for machinery intelligent fault diagnosis by studying the nature of input dimension and overfitting.As illustrated in the experiments, the proposed method can effectively extract useful features from different fault types and achieve a higher diagnosis accuracy than the original method.The following major conclusions can be drawn.

1.
The interpretation of input dimension is studied based on the harmonic signal groups and bearing vibration signals.It can be concluded that the frequency resolution of weight matrix depends on input dimension.

2.
The phenomenon known as non-monotonicity in this paper is explained as overfitting, which results from row vectors of weight matrix which are not orthogonal.

Figure 2 .
Figure 2. Diagnosis results of different input dimensions of the two harmonic signals groups.

Figure 2 .
Figure 2. Diagnosis results of different input dimensions of the two harmonic signals groups.

Figure 2 .
Figure 2. Diagnosis results of different input dimensions of the two harmonic signals groups.

Figure 3 .
Figure 3. Relationship between the amplitudes of samples and the learned features.

Figure 4 .
Figure 4. Diagnosis results using various input dimension of harmonic signals group of different frequencies.

Figure 4 .
Figure 4. Diagnosis results using various input dimension of harmonic signals group of different frequencies.

Figure 5 .
Figure 5. Relationship between the frequencies of samples and the learned features: (a) N in = 200; (b) N in = 100.

Figure 6 .Figure 7 .
Figure 6.Classification results of sparse filtering of various input dimensions.

Figure 7 .
Figure 7. Selected weight vectors for the motor bearing dataset and the fitting vector by Gabor function: (a) Vectors in the time domain; (b) Fourier transforms.

1 . 2 .
Collect signals.The vibration signals of machines are obtained under different health conditions.These signals compose the training set sample containing M vibration data points and y i is the health condition label.We collect Ns segments from each sample to compose the training set Whitening.It is necessary to pre-process S by whitening.Whitening uses the eigenvalue decomposition of the covariance matrix

Figure 9 .
Figure 9. Row vectors of W: (a) Vectors of N in = 100 in the time domain; (b) Vectors of N in = 100 in the frequency domain; (c) Vectors of N in = 200 in the time domain; (d) Vectors of N in = 200 in the frequency domain.

Figure 10 .
Figure 10.Illustration of the proposed two-stage learning method.

Figure 10 .
Figure 10.Illustration of the proposed two-stage learning method.

Figure 11 .
Figure 11.Diagnosis results using various input dimensions of modified and original sparse filtering.

Figure 12 .
Figure 12.Diagnosis results obtained by different percentages of samples using modified and original sparse filtering.

Figure 11 .
Figure 11.Diagnosis results using various input dimensions of modified and original sparse filtering.

Figure 11 .
Figure 11.Diagnosis results using various input dimensions of modified and original sparse filtering.

Figure 12 .
Figure 12.Diagnosis results obtained by different percentages of samples using modified and original sparse filtering.Figure 12. Diagnosis results obtained by different percentages of samples using modified and original sparse filtering.

Figure 12 .
Figure 12.Diagnosis results obtained by different percentages of samples using modified and original sparse filtering.Figure 12. Diagnosis results obtained by different percentages of samples using modified and original sparse filtering.

Figure 14 .
Figure 14.Results of WW T of Nin = 200 using the proposed method.

Figure 13 .
Figure 13.Visualization of features of validation dataset processed by t-SNE.

Figure 14 .
Figure 14.Results of WW T of Nin = 200 using the proposed method.Figure 14. Results of WW T of N in = 200 using the proposed method.

Figure 14 .
Figure 14.Results of WW T of Nin = 200 using the proposed method.Figure 14. Results of WW T of N in = 200 using the proposed method.

Figure 15 .
Figure 15.Row vectors of W: (a) Vectors of Nin = 200 in the time domain; (b) Vectors of Nin = 200 in the frequency domain.

Figure 15 .
Figure 15.Row vectors of W: (a) Vectors of N in = 200 in the time domain; (b) Vectors of N in = 200 in the frequency domain.

Figure 15 .
Figure 15.Row vectors of W: (a) Vectors of Nin = 200 in the time domain; (b) Vectors of Nin = 200 in the frequency domain.
Appl.Sci.2018, 8, x 14 of 17 tested under three different speeds.Note that the speed fluctuations also existed among different faults and the values of speeds are shown in Table

Figure 17 .
Figure 17.Diagnosis result using various input dimensions of modified and original sparse filtering.

Figure 17 .
Figure 17.Diagnosis result using various input dimensions of modified and original sparse filtering.

Figure 17 .
Figure 17.Diagnosis result using various input dimensions of modified and original sparse filtering.

Figure 18 .
Figure 18.Diagnosis result trained by different percentages of samples using modified and original sparse filtering.

Figure 18 .
Figure 18.Diagnosis result trained by different percentages of samples using modified and original sparse filtering.

Figure 19 .
Figure 19.Confusion matrix of the gear dataset.

Figure 19 .
Figure 19.Confusion matrix of the gear dataset.
Calculate the local features.The training sample x i is alternately divided into K segments, Visualization of features of validation dataset processed by t-SNE.
Visualization of features of validation dataset processed by t-SNE.

Table 1 .
Gear parameters of the test gearbox.

Table 1 .
Gear parameters of the test gearbox.

Table 1 .
Gear parameters of the test gearbox.

Table 2 .
Speeds of different fault types.

Table 2 .
Speeds of different fault types.