The feasibility and effectiveness of the diesel engine condition identification method with SSST-ViT were validated using publicly available datasets and measured data from Case Western Reserve University (CWRU). The experiments were conducted using Windows 11 with a 12th Gen Intel(R) Core (TM) i7-12700H 2.30 GHz processor, a GeForce RTX 3060 Laptop GPU, 16G of RAM, Anaconda3, Python 3.9.13, and MATLAB2021b software environment. MATLAB2021b; deep learning framework is PyTorch1.11.0.
3.1. CWRU Dataset to Verify the Feasibility of SSST-ViT Method
Both rolling bearing and diesel engine vibration signals are characterized by time-varying, nonlinear non-smoothness [
28]. Therefore, the feasibility of SSST with the ViT method was verified using the publicly available CWRU bearing vibration signal dataset. According to the literature, the publicly available dataset was obtained through a bearing failure simulation experiment bench.
The experiments were conducted using a deep groove ball bearing, model SKF6205, with a single point of failure of the bearing machined with electrical discharge machining (EDM), and the vibration acceleration signal of the bearing was collected using an accelerometer. The specific data used were the drive-end bearing data with a sampling frequency of 48 kHz, an approximate motor speed of 1797 r/min, and a load of 0 hp. The bearing statuses include normal, inner ring failure, outer ring failure, and rolling element failure, and each failure state can be classified into three types according to the depth of cut: 0.1778 mm, 0.3556 mm and 0.5334 mm. The status data of 10 bearings selected in this experiment are shown in
Table 1.
The time-domain analysis of the vibration signals in each status of the bearing was performed. The data length of each fault status was intercepted to 5120 sampling points, and the time-domain waveforms of the bearing in 10 statuses were obtained, as shown in
Figure 3.
As seen in the time-domain waveform diagram of the vibration signal, the time-domain waveforms of the status fluctuate widely, making it difficult to carry out effective fault status identification [
29]. The signal waveforms of different status types are complex and do not differ greatly, and the individual status cannot be identified directly by hand. Therefore, it is difficult to carry out rolling bearing fault status identification from time-domain signal waveform analysis alone, and a more effective intelligent identification method is needed [
30].
The SSST-ViT method proposed in this paper was applied to identify each status of the bearings. From each bearing status data point, 300 samples were randomly taken, and each sample length was 1024 sampling points, so a total of 3000 samples were obtained. By dividing the training set, validation set and test set according to the ratio of 7:2:1, 2100 training samples, 600 validation samples and 300 test samples were obtained for the feasibility verification experiments of SSST-ViT state recognition methods.
The SSST was performed on the original vibration signal to obtain the time-frequency diagram. To avoid the influence on the classification results, the coordinate system, legend and blank part were set not to be displayed, and the time-frequency diagram of the first sample in each state after processing is shown in
Figure 4.
The warm and cold colors in
Figure 4 represent energy values, the warmer the color the greater the energy, reflecting the energy magnitude of the signal at each frequency; the horizontal and vertical axes indicate time and frequency, respectively, showing the change of signal frequency components with time. The energy of the time-frequency diagram of each state is more concentrated, with good time-frequency resolution, and the contained features are different, corresponding to different time-frequency diagrams, with the warm color part showing irregular block distribution. Although there are certain differences in expression, the similarity is high, and it is difficult to manually distinguish each fault status accurately. Therefore, each fault status was identified by the ViT network with a powerful image classification function.
The images were first set to not show the legend, coordinate system and blank parts. Then each time-frequency map was normalized to speed up the model convergence. Finally, without affecting the recognition rate, the grid was normalized and compressed to process the time-frequency map to improve the model training speed, and the image size was uniformly adjusted to .
After considering the network structure, computer hardware level and sample characteristics and size, the parameters of the ViT network during training were configured as follows: batch processing size of 16; learning rate of 1 × 10−3; weight decay of 1 × 10−5; number of iterations—100; input image size of ; number of classification categories—10; optimizer—stochastic gradient descent; and loss function—cross-entropy loss function. The experimental results were extracted from the training log and plotted.
ViT is the first transformer model used to replace the CNN and applied to image classification, which is able to achieve the desired results in the field of image classification. Therefore, in this paper, a ViT model is applied to diesel engine fault status identification. Since the network model is more suitable for extracting feature information from high-dimensional data, it is necessary to convert the one-dimensional vibration signals of diesel engines into two-dimensional images by some method. The S-transform combines the advantages of the continuous wavelet transform and the short-time Fourier transform, which has higher noise robustness and time-frequency analysis accuracy. However, the energy at a certain moment in the time-frequency map obtained by this method is distributed in a wider bandwidth near the instantaneous frequency, which causes instantaneous frequency energy leakage, leading to problems such as frequency band mixing and lower time-frequency resolution. Therefore, the synchro squeezing transform is combined with the S-transform to obtain the synchro squeezing S-transform method, which can effectively improve the time-frequency aggregation, time-frequency resolution and noise robustness compared with the traditional time-frequency analysis method. Therefore, the diesel engine fault status identification method proposed in this paper is obtained: SSST-ViT. The reason why ST-ViT is used as a comparison method in this paper is to verify that the SSST method can better characterize the feature information in the original signal and better retain the useful information. The reason why SSST-2DCNN is used as a comparison method in this paper is to verify that the ViT model has a more powerful image classification capability compared with the traditional CNN model, and is more suitable for identifying each fault status of the diesel engine. The reason why the FFT spectrum-1DCNN model is used as a comparison method in this paper is to verify that the 2D signal can better take into account the correlation of the signal in the time series compared to the 1D signal, i.e., to verify that the transformation of the original signal into a 2D image is more effective compared to the direct input of the 1D signal into the network model. In summary, it is completely reasonable to use the ST-ViT model, SSST-2DCNN model, and FFT spectrum-1DCNN model as comparisons in this paper.
In Reference [
31], the authors proposed an intelligent fault diagnosis model for rolling bearings based on ViT, and achieved good results. Therefore, in this paper, the ViT model is applied to diesel engine fault status identification for the first time, and the feasibility and effectiveness of the proposed method are verified. Since ViT is the first transformer model used to replace the CNN and applied to image classification, and the network model is more suitable for extracting feature information from high-dimensional data, this paper proposes a diesel engine fault status recognition method based on SSST and ViT. Therefore, the ST-ViT model is improved on the basis of Reference [
31], through which the model can be used to verify the superiority of the SSST time-frequency analysis method. In Reference [
32], the authors proposed a 2DCNN-based fault diagnosis method for diesel engines by importing short-time Fourier transform (STFT) time-frequency maps into the 2DCNN model for training. However, the time-frequency map obtained using the STFT method suffers from low time-frequency resolution and weak energy aggregation. Therefore, the comparison method combines the SSST method with high time-frequency resolution and time-frequency aggregation with the 2DCNN, i.e., the SSST-2DCNN model is an improvement on Reference [
32]. In Reference [
33], the authors proposed a 1DCNN-based fault diagnosis method for diesel engines, where the features in the vibration signal are extracted and then input to the 1DCNN model for training. Since the method proposed in Reference [
33] inputs multiple vibration signal features into the 1DCNN model for training, it tends to generate redundancy, which leads to a reduction in model efficiency. Therefore, the FFT spectrum is used as the fault feature in the comparison method of this paper. This is because different fault statuses of diesel engines generate different frequencies of fault features, and the FFT spectrum can well reflect the fault features of diesel engines at different fault states. Therefore, the source of the FFT spectrum-1DCNN model is Reference [
33].
The training results of this model are compared with the training results of ST-ViT, SSST-2DCNN, and FFT spectrum-1DCNN models. The loss values and accuracy results of the training and validation sets of each model were obtained, as shown in
Figure 5. The fault status identification results after 100 iterations are shown in
Table 2 (Model 1: SSST-ViT; Model 2: ST-ViT; Model 3: SSST-2DCNN; Model 4: FFT spectrum-1DCNN).
As seen in
Figure 5 and
Table 2, the different fault status identification models have converged after 100 iterations and all perform well on the CWRU public dataset. In terms of model accuracy and loss values, the SSST-ViT method proposed in this paper has the fastest convergence speed at iteration, the highest accuracy and the lowest loss values on both the training and validation sets, and the best performance compared to the other three methods. In terms of training stability, the method is optimal, and the accuracy and loss value curves are generally very stable, while the other three comparison methods all show different degrees of fluctuations. Therefore, compared with the comparison models, the SSST-ViT fault status identification method has better performance in terms of accuracy, loss value and stability, and the feasibility of the proposed fault status identification method has been verified.
The accuracy and confusion matrix of different fault status identification models obtained under the test set are shown in
Table 3 and
Figure 6, respectively.
From
Figure 6 and
Table 3, it can be found that the proposed method has the optimal fault identification effect compared with other methods and can effectively distinguish the easily confused bearing fault types. To verify the feature extraction capability of SSST-ViT methods, the output of the classification layer network of the ViT model was extracted as discriminative features, and the identification results of bearing fault status were visualized in three dimensions by the t-SNE nonlinear dimensionality reduction technique, which is applicable to the visualization of high-dimensional data. The original data of the training set, the original data of the test set, the feature data of the training set and the feature data of the test set were obtained, as shown in
Figure 7.
In
Figure 7, using the test set feature data as an example, since none of the methods proposed in this paper achieved 100% accuracy under the test set, there must have been some points that did not fall within a cluster. In other words, it is because some features are identified as features of other fault statuses that some feature points are not in a cluster and therefore the accuracy is not 100%.
3.2. The Validity of the SSST-ViT Method Is Verified by the Measured Data
In order to verify the effectiveness of the SSST-ViT diesel engine fault status identification methods, this study relies on the high-pressure common rail diesel engine experimental bench in the laboratory, taking a CA6DF3-20E3 diesel engine as the research object to collect the state monitoring information during the operation of the diesel engine in different fault modes and provide data support for the research of diesel engine fault status identification methods. The experimental bench can be divided into two parts: the diesel engine system and the data acquisition system. The panoramic view of the experimental bench is shown in
Figure 8, and the sensor installation is shown in
Figure 9.
By analyzing the composition structure and function of the diesel engine, and combining its typical failure modes in the process of use and maintenance, the pre-set failure experiment is carried out in the diesel engine condition-monitoring experimental bench (by artificially processing or replacing the faulty parts, the diesel engine components are pre-set to collect the data in the engine fault status and carry out research). Typical failure modes of diesel engines were set as shown in
Table 4:
In the actual equipment maintenance and repair process, due to the complex and harsh working environment, diesel engine failures may often be due to multiple, concurrent faults rather than a single failure mode. Therefore, when presetting the failure modes, a single failure mode is preset on the one hand and three mixed failure modes are preset on the other. As shown in
Figure 10, the cylinder misfire fault is simulated by disconnecting the cylinder ignition power line, and the air intake outer cover is added to simulate the air filter blockage fault.
The diesel engine cylinder head vibration signal is acquired with a sampling frequency of 20 kHz, a single sampling time of 12 s, and a sample sampling interval of 30 s. After data acquisition experiments, there are 300 sets of data for each failure mode, six channels of data for each set, and a single sampling data volume of 240,000. In order to avoid the errors brought about by the process of the engine from start-up to steady status, the first 20 sets of data for each status are selected, and the last 20 sets of data from the 5th channel of each status are selected as the sample data.
Due to the simplicity, intuitiveness and clear physical meaning of the time-domain signal, the time-domain analysis of the vibration signal in each status of the diesel engine was performed. The length of the data intercepted for each failure mode sample is 5000 sampling points, and the time-domain waveforms of the diesel engine in each status are obtained as shown in
Figure 11.
The engine head vibration signal presents a nonlinear and non-smooth status, and there are complex noise disturbances generated by the environment and the comprehensive action of each component during operation, so it is difficult to identify the fault status. From the time-domain waveform diagram in
Figure 11, it can be seen that the vibration signal waveforms under different fault modes are complex and have basically the same amplitude change range, and there is no obvious difference from the time-domain waveform amplitude, so it is difficult to manually identify each status directly, so it is difficult to achieve effective identification of multiple engine faults from time-domain signal waveform analysis alone, and more effective fault information extraction and intelligent identification methods are needed.
The SSST-ViT method was applied to identify each status of the above diesel engine. A total of 2100 samples were obtained by taking 300 samples from each status of the diesel engine data, each with a sample length of 5000 sampling points. By dividing the training and validation sets according to the ratio of 7:2:1, 1470 training samples, 420 validation samples and 210 test samples were obtained, i.e., each status sample data included 210 training samples and 60 validation samples and 30 test samples. Processing of raw vibration signals by was carried out using the SSST and represented as a two-dimensional color time-frequency diagram, and the coordinate system, legend and blank part were set not to be displayed to avoid the influence on the classification results. The time-frequency diagram of the first sample in each status of the diesel engine after processing is shown in
Figure 12.
Although each status in
Figure 12 has some different expressions, the similarity is high, and it is difficult to distinguish each fault status only by hand. Therefore, each fault status is identified by the ViT network with a powerful image classification function.
The images were first set to not show the legend, coordinate system and blank parts. Then each time-frequency map was normalized to speed up the model convergence. Finally, the grid normalization compressed the time-frequency maps without affecting the recognition rate, and the image size was uniformly adjusted to .
After considering the network structure, computer hardware level and sample characteristics and size, the parameters of the ViT network during training were configured as follows: batch processing size of 16; learning rate of 1 × 10−3; weight decay of 1 × 10−5; discard rate of 0.1; number of iterations—100; input image size of 224 × 224; number of classification categories—7; optimizer—stochastic gradient descent; loss function—cross entropy loss function. The experimental results are extracted from the training log and plotted.
The training results of this model are compared with those of the ST-Vision Trans-former, SSST-2DCNN, and FFT spectrum-1DCNN models. The loss values and accuracy results of the training and validation sets of each model were obtained as shown in
Figure 13. The fault status identification results after 100 iterations are shown in
Table 5 (Model 1: SSST-ViT; Model 2: ST-ViT; Model 3: SSST-2DCNN; Model 4: FFT spectrum-1DCNN).
From
Figure 13 and
Table 5, it can be seen that in terms of model accuracy and loss values, the proposed SSST-ViT methods has the highest accuracy and lowest loss values in both training and validation sets with the best performance in terms of fast convergence during iterations compared with the other three compared methods. In terms of training stability, the accuracy and loss value curves of SSST-ViT methods is generally more stable. Therefore, compared with the comparison methods, SSST-ViT has better performance in terms of fault identification accuracy, loss value and stability.
The performance of the models was evaluated under the test set, and the accuracy and confusion matrix of different fault status identification models were obtained as shown in
Table 6 and
Figure 14, respectively.
It can be found in
Table 6 and
Figure 14 that the proposed method in this paper has the optimal diesel engine fault identification effect compared with other methods, and can effectively distinguish the confusing fault types.
In order to test the feature extraction ability of the SSST-ViT method, the output of the classification layer network of the ViT model was extracted as the discriminative features, and the results of fault status recognition were visualized in three dimensions by the t-SNE nonlinear dimensionality reduction technique, which is suitable for visualizing high-dimensional data. The original data of the training set, the original data of the test set, the feature data of the training set and the feature data of the test set were obtained, as shown in
Figure 15.
In
Figure 15, using the test set feature data as an example, since none of the methods proposed in this paper achieved 100% accuracy under the test set, there must have been some points that did not fall within a cluster. In other words, it is because some features are identified as features of other fault statuses that some feature points are not in a cluster and therefore the accuracy is not 100%. As can be seen in
Figure 15, the SSST-ViT method has excellent feature extraction performance, and the features of each fault status in the space have obvious differentiability. Different fault status types are distributed in different locations in the space and exhibit dense clustering.
In summary, the effectiveness and superiority of the proposed diesel engine fault status recognition method are verified. The SSST-ViT method can effectively extract fault features and has high recognition accuracy compared with other methods.