1. Introduction
Internal combustion engines (ICEs) serve as one of the core power sources for mechanical equipment. Due to their high thermal efficiency and convenient mobility, they are currently widely applied in transportation, construction machinery, power generation units, and general aviation sectors [
1]. Increasingly strict environmental protection requirements have led to the growing complexity of ICEs structures, resulting in a gradual increase in the probability of malfunctions [
2]. As the primary power source for most mechanical equipment, any failure in an ICE could lead to economic losses or even accidents. Monitoring the abnormal operating conditions of ICEs is of significant importance for enhancing safety and preventing the occurrence of failures [
3].
Abnormal condition monitoring requires early identification of potential issues. Operational parameters of ICEs, such as rotational speed, water temperature, oil pressure, or indirect signals like vibration and noise, can all serve as signal sources for identifying abnormal states [
4,
5,
6]. Among these, vibration signals are the most widely used due to their rich information content, simplicity of measurement, and high signal-to-noise ratio. Many researchers decompose and extract vibration signals to obtain fault-sensitive components and monitor fault occurrences based on thresholds established under normal operating conditions. For instance, Chegini et al. employed Ensemble Empirical Mode Decomposition combined with Wavelet Packet Decomposition to extract frequency-domain-sensitive signals from bearing vibrations, effectively detecting the onset of degradation [
7]. Han et al. proposed using Local Mean Decomposition to extract multiple components of bearing faults and then utilized dynamic information entropy as a feature for fault recognition [
8]. Xu et al. utilized Variational Mode Decomposition to extract bearing fault characteristics and simultaneously applied a multi-point kurtosis deconvolution method to enhance the impact components [
9]. While these methods can effectively extract fault-sensitive features and contribute to fault mechanism research, they require strong expert knowledge and involve complex data processing workflows, typically allowing only offline analysis and not enabling online recognition.
With the development of deep learning, the accuracy of fault diagnosis has gradually improved, and online abnormal condition monitoring for ICEs is also becoming feasible. Hasan et al. converted multi-sensor fault signals into images using STFT and then fused and input them into a CNN (Convolutional Neural Network) for identification [
10]. However, the process of converting signals into images is unnecessary and time-consuming, making it difficult for real-time diagnosis. Liu et al. designed a multi-task CNN model for speed and load identification, improving fault recognition accuracy [
11]. CNNs can only capture spatial characteristics of the signal, failing to capture temporal features or establish long-term dependencies. As a result, RNNs (Recurrent Neural Networks) and attention mechanism networks emerged. Zhang et al. applied a Bidirectional RNN model for fault detection in chemical processes, achieving good results [
12]. Shiney et al. developed a gasket inspection system using a multi-layer CNN model, which ensures the correct alignment of the radiator gasket through image recognition. The diagnostic results of the actual dataset demonstrate the effectiveness of the system [
13]. Soheil et al. utilized deep learning-based image segmentation techniques to achieve fault diagnosis of solenoid starters in DC motors [
14]. Currently, most researchers utilize deep learning models for fault diagnosis by converting vibration signals into images or similar approaches for feature extraction. However, one-dimensional signals not only contain time-domain information but also frequency domain characteristics, and both types of information are crucial for fault-related analysis.
To address the challenges of complex feature extraction and high data dependency in current internal combustion engine fault diagnosis, this paper aims to develop a deep learning model that integrates time-domain and frequency-domain information for comprehensive feature representation. The proposed Time–Frequency Domain Diagnosis Network (TFDN) leverages residual structures and self-attention mechanisms to capture temporal dependencies, while utilizing CNNs for spectral features, enhancing diagnostic accuracy and robustness. Through experimental validation on simulated diesel engine faults, we demonstrate that TFDN achieves superior performance compared to state-of-the-art methods, maintaining high accuracy even under data-scarce conditions. The main conclusion is that TFDN provides an efficient, end-to-end solution for real-time fault detection, facilitating reliable monitoring of ICE health.
3. Time–Frequency Domain Diagnosis Network
Internal combustion engines are reciprocating machines, and their vibration signals primarily originate from periodic combustion explosion impacts and the knocking caused by the reciprocating motion of the piston. Additionally, vibrations are induced by the operation of various components, coupled with the mutual coupling of vibrations during transmission. As a result, the measured signals often contain a large amount of complex information. Time domain vibration signals can provide information about the temporal variations in vibrations, including their amplitudes, waveform characteristics, and periodicity, among others. When reflected in the frequency spectrum, vibration signals manifest as energy at different frequencies and their distribution characteristics. Conventional CNN or CNN-Transformer models commonly employ one of three strategies: converting temporal signals into images, or utilizing solely temporal-domain or frequency-domain data as the input for fault diagnosis. However, when a fault occurs, it may be characterized by changes in the time-domain waveform or alterations in the vibration energy and distribution in the frequency domain. While employing deeper networks and ample data can enhance diagnostic accuracy, a more efficient and direct approach is to design dedicated networks for time-domain and frequency-domain feature extraction. Combining time-domain and frequency-domain signals provides a more comprehensive diagnostic information, which helps to improve the accuracy. For diagnostic models, simultaneously incorporating time-domain and frequency-domain information improves the model’s feature representation capability, enabling complementary features.
Based on the above analysis, this section will establish a deep learning model that integrates both time domain and frequency domain signals from vibration signals. Assume the input signal is
, where
B denotes the batch size and
L represents the signal length. The overall forward propagation formula is as follows. The FFT(x) denotes the Fourier transform, and “[,]” represents the concatenation operation.
In the time domain feature extraction branch, the single-head self-attention mechanism (SAM) effectively extracts sequential features by modeling within sequences to capture global information and long-range dependencies. However, due to the large number of parameters in SAM, directly extracting features from raw signals is challenging. Therefore, multiple convolutional and residual convolutional network layers are set before the SAM layer to initially extract signal features, ultimately forming a time-domain feature extraction network. By using traversal methods, we have sought to reduce the number of residual blocks and convolutional layers to minimize the model parameters. By adding BN and replacing the activation function with the Swish function, the performance of the time-domain feature extraction model is gradually improved. The network architecture is detailed in
Table 1. In the table, “64@3×1” indicates 64 output channels and a convolution kernel width of 3; P denotes the dropout rate; FC stands for a fully connected layer; and c represents the number of classes.
The frequency domain signal is characterized by a line spectrum, which essentially contains only spatial distribution features. Since CNNs have strong capabilities in extracting spatial features, they are suitable for frequency-domain feature extraction. Through testing, the performance of the multi-layer convolutional neural network is already sufficient to extract the frequency-domain features of the signal. Therefore, for frequency-domain signals, a set of CNN networks is utilized for feature extraction, including three layers of convolutional layers, pooling layers, and batch normalization layers, with the Swish function used for activation. The frequency domain feature extraction network is shown in
Table 2. Among them, the parameters of the FFT layer are not learnable.
By combining the time domain and frequency domain feature extraction networks, we can obtain the feature set of the signal in both time and frequency domains. After merging the final feature layers and performing classification, the fault diagnosis results can be obtained. The last two layers of the networks in
Table 2 and
Table 3 are shared.
Figure 3 illustrates a schematic diagram of the Time–Frequency Domain Diagnosis Network (TFDN). Based on the above partial content, the complete forward propagation expression can be represented by Equations (10)–(12).
where
represents the extracted time domain features, and
represents the extracted frequency domain features.
W denotes the weight matrix of the neural network, and
b denotes the bias matrix.
4. Case Study
To validate the effectiveness of the ICE abnormal state identification method, we designed a simulated fault test on an inline six-cylinder diesel engine and collected vibration acceleration data from the cylinder head and engine block, as shown in
Figure 4. The diesel engine test bench system consists of two parts: the test bench and the signal acquisition system. The test bench includes the diesel engine, an electric dynamometer, a driveshaft, as well as water, oil, and air piping systems, as shown in
Figure 5. The signal acquisition system comprises sensors, wiring harnesses, data acquisition frontends, and a test computer. Vibration acceleration was measured using PCB’s 621B40 type sensor (Depew, NY, USA), and engine speed was tested using the SPSE-115 type photoelectric sensor from MONARCH Company (Medicine Hat, AB, Canada).
Nahim et al. conducted a review and analysis of the main fault types in diesel engines and statistically analyzed the probabilities of various types of faults [
19]. The results showed that faults in the fuel injection and fuel supply systems, waterway leakage faults, and valve seat faults have the highest probabilities of occurrence. Waterway leakage, which affects engine cooling, can be easily detected using temperature sensors. Faults in the fuel supply system and valve seat system involve complex mechanical structures, making diagnosis more challenging. For these two types of faults, we designed four categories of faults: abnormal injection pressure, abnormal fuel injection, abnormal injection advance angle, and abnormal valve clearance, including different degrees of abnormal states. The specific faults and their parameters are shown in
Table 3, comprising a total of twelve distinct states.
During the test, the sampling frequency was set to 25,600 Hz, and the test speeds included three levels: 1600 r/min, 2000 r/min, and 2300 r/min. The engine operated under two load conditions: 50% load and 100% load. Therefore, when intercepting data, the length was determined based on the duration of a single combustion cycle at the minimum speed (0.075 s), so L should be greater than 1920. In this paper, L was set to 2048. When intercepting data, a 20% overlap sampling ratio is adopted to increase the data volume. After data interception, no windowing, filtering, or normalization operations were performed.
The proposed method is compared with Random Forest (RF), Long Short-Term Memory (LSTM), CNN, and ResNet18. The parameter settings for each method are shown in
Table 4. Among these, the time-domain extraction network (TDEN) represents the time-domain feature extraction part of the proposed method in this paper. The parameter counts of each model used are listed in
Table 4, where the parameters of RF are calculated based on an average of 1500 nodes per decision tree. It can be seen that the number of parameters of the proposed model is smaller than that of ResNet18 and LSTM, but larger than that of the CNN and RF model.
Using the methods listed in
Table 4 to diagnose engine fault data, we select data from three sensors: the first cylinder head (1H), the first cylinder block (1B), and the third cylinder head (3H). Each dataset includes three engine speeds: 1600 rpm, 2000 rpm, and 2300 rpm. Among the fault types, abnormal injection pressure, abnormal fuel injection, and abnormal injection advance angle occur simultaneously in all cylinders, while abnormal valve clearance occurs only in the first cylinder. Each fault type has 200 samples, with each sample containing data for one engine cycle. The ratio of the training set, test set, and validation set is 3:1:1., and each model was trained for 200 epochs.
The diagnostic results under full load and single-speed conditions are shown in
Table 5. The results indicate that RF lacks sufficient feature extraction capability, achieving only 20% accuracy, as RF heavily relies on preprocessing and cannot perform end-to-end diagnosis. LSTM also shows low diagnostic accuracy, as this method performs well for regression tasks but is less effective for classification tasks. CNN has limited temporal feature extraction capability, and due to the complexity of engine time-domain signals, its accuracy fluctuates between 50% and 85%. ResNet, with its deeper network layers, demonstrates excellent feature extraction performance, achieving an accuracy rate of 88.75%~99.58%, significantly outperforming CNN. The TFEN method has fewer network layers than ResNet, but its accuracy is very close to that of ResNet, suggesting that the introduction of SAM effectively enhances temporal feature extraction capability. The TFDN method achieves an accuracy of 98.12%~99.79%, indicating that the introduction of frequency domain features significantly improves diagnostic stability.
To verify the diagnostic performance of the algorithm on mixed speed and load operating conditions data, datasets from sensors 1H, 1B, and 3H were combined and shuffled. These datasets included speeds of 1600 r/min, 2000 r/min, and 2300 r/min, as well as loads of 50% and 100%, with each condition containing 200 samples. The ratio of the training set, test set, and validation set is 3:1:1. To ensure a uniform dimension, each sample was truncated to 2048 data points. The diagnostic results obtained using the aforementioned algorithms, including accuracy, precision, and recall, are shown in
Table 6. As the amount of training data increased, the accuracy of LSTM improved. However, the accuracy of ResNet, TDEN, and TFDN slightly decreased. Nevertheless, TFDN still demonstrated the best performance.
By observing the changes in error and accuracy during the training process, a better assessment of model performance can be made. Based on the results in
Table 6, CNN, ResNet, and TFDN methods were selected for comparison, with the 3H dataset as an example.
Figure 6a–c show the loss and accuracy changes during the training process of the aforementioned three algorithms. The results indicate that the CNN has average fitting capability and insufficient time-domain feature extraction ability. Although ResNet achieves better fitting in the end, there are larger fluctuations in the first 30 epochs due to its deep model architecture, which requires a higher data quantity and training time. The proposed TFDN model demonstrates fast convergence, the best fitting performance, and the highest training efficiency.
To compare the stability of various methods during multiple calculations, the performance of the CNN, ResNet, and TFDN methods in three repeated five-fold cross-validation tests is compared below. Taking the combined dataset of all rotational speeds and all loads from sensor 3H as an example. The changes in the accuracy, precision, and recall indicators of each method are also presented. The results of the three methods are shown in
Figure 7,
Figure 8 and
Figure 9, respectively. The blue shadows in
Figure 7a,
Figure 8a, and
Figure 9a represent the overall error range of each method across three repeated tests.
Figure 7 shows that in the repeated tests of the CNN method, all indicators vary between 0.8 and 0.85, with moderate accuracy. Both ResNet and TFDN achieve relatively high accuracy. Among them, the accuracy of ResNet fluctuates between 0.93 and 0.99, but it is not very stable—its accuracy drops below 0.75 in one of the folds. In the repeated tests, the accuracy of the TFDN method fluctuates around 0.99, with the lowest accuracy near 0.984. It is the best performing among the three methods. As can be seen from the boxplot, among the three methods, the indicators of CNN and TFDN are relatively stable, while those of ResNet show significant fluctuations.
The feature extraction results of the network are visualized using t-SNE, with the 3H dataset at 2000 r/min as an example. The feature extraction performance of the network is demonstrated before the final classification.
Figure 10 shows that the proposed TFDN has a good clustering effect, with clear separation boundaries between various types of data. Next is the ResNet method, which only has confusion between two types of fault data (see
Figure 11).
Figure 12 shows that the clustering effect of CNN after feature extraction is poor, with confusion among multiple types of fault data. To quantify the t-SNE results, we calculated three parameters for each result: Silhouette Score, Calinski–Harabasz index, and Davies–Bouldin index. Among them, the larger the Calinski–Harabasz index, the better; the smaller the other two indices, the better. The results are shown in
Table 7. It can be seen that the proposed method performs well in all three indices, and ResNet also achieves good results.
Data scarcity is a common issue in real-world scenarios, especially for fault data, which is typically scarce and valuable. It is necessary to study the diagnostic capability of fault diagnosis models under conditions of insufficient training data. Therefore, in the next step, the accuracy changes in various methods under conditions of missing training data will be investigated, using the 3H 2000 r/min dataset as an example. The test ratio of the dataset will be modified, allowing the test ratio to vary between 0.2 and 0.975. Originally, each fault type contains 200 data samples; therefore, when the test rate is 0.975, each category of data contains only 5 training samples.
The diagnostic results are shown in
Figure 13. As the test rate increases, the amount of training data decreases, leading to a continuous decline in the diagnostic accuracy of all three algorithms. The accuracy of CNN drops rapidly with the increase in test rate, and when the test rate is 0.975, its accuracy is only 18.42%. For ResNet, the accuracy does not decrease significantly until the test rate exceeds 0.6. When the test rate is 0.975, its accuracy is 34.53%. On the other hand, TFDN’s accuracy begins to drop rapidly only when the test rate exceeds 0.9. Even when each category has only five training samples, TFDN still achieves an accuracy of 70%. To verify the stability of TFDN under the condition of data scarcity, the proposed method is subjected to 5 repeated tests, and the variation ranges of various indicators are calculated. The results are shown in
Figure 14.
Figure 14 indicates that in multiple tests, the proposed method achieved an average accuracy of 65.26% (with a minimum of 60.36% and a maximum of 71.88%), an average precision of 70.48%, and an average recall of 65.46%. This demonstrates that the method can maintain a certain diagnostic effect even when data is scarce. The results demonstrate that the proposed TFDN model maintains a high level of accuracy even when data is scarce, indicating that the algorithm has a certain capability to handle the risk of data deficiency.
5. Conclusions
The proposed TFDN framework integrates independent and complementary time domain and frequency domain feature extraction pathways, providing an approach to address key limitations of existing internal combustion engine (ICE) fault diagnosis methods. The time domain pathway adopts residual blocks and self-attention mechanisms (SAM) to model complex temporal dynamics and long-range dependencies inherent in reciprocating engine vibrations, while the frequency domain CNN pathway is used to extract spatial distribution patterns of spectral energy.
Validation experiments were conducted on a purpose-built inline six-cylinder diesel engine test bench, covering 12 fault conditions including abnormal injection pressure, abnormal fuel injection, abnormal injection advance angle, and abnormal valve clearance. Experimental results show that the TFDN has certain diagnostic accuracy and robustness, and its performance is better than that of comparison models such as Random Forest (RF), Long Short-Term Memory (LSTM), standard CNN, ResNet18, and the time domain-only component (TDEN) of TFDN. Among them, the fusion of frequency domain features plays an important role in improving diagnostic stability, especially under variable operating conditions.
The scarcity of labeled fault data is one of the main challenges in practical fault diagnosis. TFDN exhibits resilience in low-data scenarios. As
Figure 9 illustrates, when the training data per fault class was drastically reduced to only 5 samples (test ratio = 0.975), TFDN maintained a high accuracy of 70%, which is higher than that of CNN (about 18%) and ResNet18 (about 35%) under the same conditions. This ability to learn from limited data is presumably related to the feature representation obtained by the dual-path architecture and regularization mechanisms such as dropout layers and batch normalization in the design, which endows it with application potential in practical scenarios where fault data acquisition is difficult or costly.
This study explores the issues of complex feature extraction and high data dependency in current deep learning-based ICE diagnosis. The TFDN model presents feasible solution for ICE condition monitoring systems. Its ability to perform end-to-end learning directly from raw 1D signals eliminates the need for computationally expensive signal pre-processing or transformation (e.g., to images via STFT), enhancing its suitability for deployment. However, the method proposed is based on deep neural networks and is a black-box approach with poor interpretability, which cannot facilitate the research on ICE fault mechanisms. Meanwhile, the introduction of the attention mechanism will significantly increase the hardware computational load, which is not conducive to online diagnosis. At present, many scholars have introduced physical information into deep learning models to improve the interpretability and stability of the models [
20,
21], which is also a future research direction.
In addition, the data in the paper are derived from simulated faults in the laboratory; in the real world, fault data contain greater noise, and the difficulty of diagnosis will also increase significantly. At this time, it may be necessary to conduct research on transfer learning methods. As reported in Refs. [
22,
23], there should be a clear mapping method for the mapping relationship between the source domain of laboratory fault data and the target domain of actual operating data.
Future work will focus on (a) further optimizing the network architecture for computational efficiency and real-time inference on embedded systems; (b) validating the model’s generalizability across a wider range of engine types, fault modes, and noise conditions; (c) exploring online learning strategies to adapt the model to engine aging and varying operational environments; and (d) investigating the fusion of vibration data with other sensor modalities (e.g., acoustic, thermal, pressure) within the TFDN framework for even more robust and comprehensive diagnostics.