1. Introduction
With increasing demands on aircraft engine performance and the continuous expansion of their operational envelopes, engine surge margins have become insufficient [
1]. Engine surge, as an unstable operational condition of aviation engines, is a low-frequency, high-amplitude oscillatory phenomenon of airflow along the compressor axis. It can induce severe mechanical vibrations in engine components and overheating of hot sections, leading to significant damage within a short period and seriously compromising flight safety [
2]. Therefore, accurate surge diagnosis has remained a key and challenging focus in the field of aero-engine health management.
The diagnosis of aero engine surge can be regarded as a specific type of fault detection and diagnosis problem. Many scholars have conducted extensive research on fault detection and diagnosis methods for mechanical systems, which primarily fall into two categories: knowledge-based reasoning methods and data-driven methods.
Knowledge-based reasoning methods do not require the establishment of an accurate system model. Instead, they perform computational reasoning and diagnosis based on systemic principles, long-term practical experience, and accumulated fault information, such as fault tree-based reasoning and expert systems. Knezevic et al. [
3] analyzed faults in the turbocharger of a diesel engine using Fault Tree Analysis (FTA), assessed system reliability, predicted fault causes, and ultimately achieved the goal of eliminating major faults in the air subsystem. Wang Hailan [
4] designed a fault tree-based expert system for natural gas engine fault diagnosis by analyzing the overall structure of fault diagnosis expert systems and their common fault modes, effectively improving the accuracy of natural gas engine fault diagnosis. However, with the advancement of science and technology, systems have become increasingly complex, making it very difficult to establish precise mathematical models. It is also challenging to utilize expert knowledge and practical experience for reasoning and diagnosis, often resulting in unsatisfactory outcomes when dealing with increasingly complex engine systems.
Data-driven methods eliminate the need for building precise complex system models or relying heavily on domain expert knowledge and knowledge representation and reasoning mechanisms. However, they typically require a large amount of accurate data. With the rapid development of artificial intelligence technology, deep learning has achieved remarkable results in many fields. Elashmawi et al. [
5] proposed a fault diagnosis model based on artificial neural networks, which offers the dual advantage of online monitoring, analysis, and diagnosis of gas turbine engine faults. Wu Bin et al. [
6] focused on turbofan engines and introduced Deep Belief Networks (DBN) for diagnosing performance degradation faults in engine components. This approach addressed the lack of generalization capability in shallow neural networks for diagnosis and improved the diagnostic accuracy for performance degradation faults in engine gas path components. Yuan et al. [
7] utilized Long Short-Term Memory Networks (LSTM) for fault diagnosis and remaining useful life prediction of aero engines. Chen et al. [
8] proposed a Hybrid Dilated Convolution (HDC) model based on CNN, Deep Neural Networks (DNN), and LSTM. This model achieved an accuracy of 83% in diagnosing gas path faults in aero engines. Guo et al. [
9] developed a real-time accurate bearing fault diagnosis method using wavelet transforms and deformable CNNs. Yang et al. [
10] applied autoencoders and CNNs to process bearing vibration signals, achieving effective fault diagnosis.
Compared with general rotating machinery, aero engines operate under highly variable conditions with extremely complex mission profiles. Additionally, due to the specific constraints of their working environment, there are strict limitations on sensor placement and weight in the aero engine, resulting in a limited number of fixed measurement points. The test signals received by sensors often undergo multipath propagation and are subject to interference from vibrations, aerodynamics, combustion, and other factors, leading to low signal-to-noise ratios and posing difficulties for surge diagnosis. Li et al. [
11] installed pressure sensors inside and at the outlet of a seven-stage compressor pipeline to measure static pressure inside the pipeline and total pressure at the outlet, respectively. By applying short-time Fourier transform to time–frequency analysis of the acquired signals, they successfully identified characteristic features before surge occurrence. Zheng et al. [
12] installed pressure and temperature sensors at multiple flow-direction positions in the pipeline to study the initiation and evolution of rotating instability. Pullan et al. [
13] identified spike-like features during stall and explained the physical mechanism behind their emergence. Munari et al. [
14] mounted vibration sensors on the casing and identified characteristic surge frequencies through sideband analysis of synchronous resonance frequencies.
However, most experimental studies on compressor surge currently rely on conventional pressure, temperature, and vibration sensors. Pressure and temperature sensor probes are installed inside the pipeline as intrusive sensors, introducing external interference into the compressor flow field [
15]. Vibration sensors, mounted on the casing, cannot effectively measure the airflow field. Moreover, the precursor features measured by these methods are relatively weak; for aero engine compressors operating under real conditions, these features are often obscured, making early warning challenging [
16]. Li Zepeng et al. [
17] employed a non-intrusive circumferential microphone array to conduct experimental research on a real aero engine fan test rig, achieving pre-surge feature identification through decomposition of pipeline noise modal waves. Jianpeng Ma et al. [
18] propose a method based on uniform phase intrinsic time-scale decomposition (UPITD). By analyzing the correlation between weak magnetic signals and cage rotation frequency in the time domain, the method effectively separates fault signals from cage rotational signals. Yun Li et al. [
19] proposed an improved EMD method, called FAEMD, and applied it to the fault diagnosis of bearings. The analysis of two groups of measured fault signals of rolling bearings shows that the FAEMD method has good adaptability to nonstationary signals. Xiaolin Liu et al. [
20] present a method with multiscale fusion attention CNN (MSFACNN) to diagnose the fault of aero engine rolling bearings. Yulai Zhao et al. [
21] combined with the improved TSA, a data-driven fault diagnosis strategy for the bearing-rotor system is proposed. Yao Yanling et al. [
22] proposed a surge diagnosis model for aero engines based on CNN-Seq2Seq.
Current research primarily focuses on surge occurring under steady-speed conditions, whereas surge is more likely to happen during acceleration processes. Dynamic conditions are prone to causing misjudgments. Most studies rely on signals with distinct features, making other forms or weak features susceptible to missed detection. In next-generation engines, excessively high outlet temperatures and the difficulty in measuring dynamic pressures present additional challenges.
To overcome the obstacles identified previously, the research presented in this paper encompasses the following subjects:
1. This study proposes an advanced model for predicting engine surge faults in aircraft engines with high accuracy. The core concept involves capturing subtle fault precursors by fusing spatiotemporal features of signals. The model directly yields diagnostic results from raw signals, reducing the reliance on and errors associated with manual feature extraction in traditional methods.
2. Addressing the challenges of difficult acquisition and high experimental costs for engine surge data, as well as the small-sample dilemma, the number of samples was expanded using a sliding window slicing technique. This approach ensures both the informational continuity between overlapping slices and the stationarity of feature statistics. Consequently, the engine surge diagnosis model was trained using a limited number of samples.
3. Spatial feature extraction employs CNN to analyze FFT of dynamic signals, capturing patterns across different frequency components. Temporal feature extraction utilizes BiLSTM to analyze time-frequency domain characteristics (VMD) of dynamic signals, capturing dependencies and evolutionary patterns along the temporal dimension. Spatiotemporal feature fusion adopts a cross-attention mechanism, enabling deep interaction between spatial and temporal features. This allows for more precise identification of easily overlooked weak fault characteristics, thereby enhancing the accuracy of surge diagnosis.
4. Through comparative analysis of experimental results, this study critically examines the findings, which indicate that the model achieved an F1-score, Recall, Precision, and Accuracy of 97.96%, 97.52%, 98.43%, and 99.01%, respectively, for surge fault classification. These results demonstrate that the model can meet the practical requirements for engine surge diagnosis.
3. Methodology
The STFF-CANet diagnosis model for engine surge in aircraft engines, based on spatiotemporal feature fusion, is proposed as illustrated in
Figure 4. The overall workflow of the model comprises three main stages: data preprocessing, feature extraction, and spatiotemporal feature fusion. It aims to extract and fuse effective features from surge fault data to achieve enhanced performance in fault identification or analysis.
3.1. Data Pre-Processing
To address the challenges of difficult data acquisition, high experimental costs for engine aero engine surge data, and the problem of small-sample data, this study aims to train an efficient engine surge diagnosis model using a limited number of samples. We selected a portion of historical test data from a certain type of aero engine as the dataset. This test data includes dynamic pressure data from various scenarios such as start-up, acceleration, deceleration, thrust increase, and distortion.
The surge fault timestamps were manually annotated by experts based on corresponding rules and experience. For example, when the numerical curve from the speed sensor stabilizes, if the pressure sensor exhibits a sudden rise, drop, or severe fluctuation until it returns to stability, this interval is marked as the surge interval. Every moment within this interval is considered a surge occurrence, while all other moments are labeled as normal operation. After manual annotation, the ratio of normal points to surge points in the dataset is approximately 90:10.
The original dataset was shuffled and then split into training, validation, and test sets in a 7:2:1 ratio. This ratio is a widely adopted practice in machine learning for limited datasets, ensuring a sufficiently large training set for model learning while reserving adequate samples for unbiased validation and final testing. In our case, the 70% training set allows the model to learn the underlying patterns, the 20% validation set is used for hyperparameter tuning and preventing overfitting, and the 10% test set provides a final evaluation on unseen data, simulating real-world application scenarios. The model was trained on the training set and evaluated on the validation set. Once the optimal parameters were found, the final evaluation was conducted on the test set. Let T = {t1, ⋯, tn} represent the training set with a total of n timestamps; ti = {si1, ⋯, sim} represents the test data at the i-th timestamp; sij is the value from the j-th pressure sensor at the i-th timestamp, with a total of m pressure sensors. R = {0,1} denotes the diagnosis result, where 0 indicates normal operation and 1 indicates a surge fault.
To increase the number of samples while ensuring both informational continuity between overlapping slices and the statistical stationarity of features, we employed a sliding window slicing method to process the time-series data and construct the model’s input, which is proposed as illustrated in
Figure 5. Specifically, data collected from each sensor is segmented using a fixed-size window with a fixed sliding stride. This approach augments the number of samples in the dataset, providing more data for model training. Particularly when experimental data collection is limited, a smaller sliding stride can be used to obtain more samples. Furthermore, the overlapping sub-windows generated during sliding window sampling help the model more easily learn the characteristics of fault sequences. The size of the sliding window corresponds to the time-step length of a single training sample.
The data augmentation through window slicing effectively mitigates overfitting in small-sample data classification, enhancing the model’s robustness and generalization capability. Once the data enters the model, preprocessing is first performed on the surge fault data to extract features from both frequency-domain and time-domain perspectives. Frequency-domain features are extracted by applying the FFT to the original surge fault data, converting the time-domain signal into a frequency-domain signal. FFT reveals the frequency components of the signal, aiding in the analysis of its distribution across different frequencies. The resulting frequency-domain features are represented as a series of spectrograms. Time-domain features are obtained by processing the surge fault data using the VMD method. VMD decomposes complex signals into multiple IMFs with finite bandwidth and specific center frequencies. These IMF components represent the time-domain characteristics of the signal across different frequency bands, thereby capturing local variations and detailed information of the signal along the temporal dimension.
Figure 5.
Sliding Window Slicing Data Processing Method.
Figure 5.
Sliding Window Slicing Data Processing Method.
3.2. Feature Extraction
Following data preprocessing, the model employs a CNN and a BiLSTM to perform further extraction of frequency-domain and time-domain features, respectively:
- •
Spatial Feature Extraction (CNN): The frequency-domain features obtained via FFT transformation are fed into the CNN. The CNN processes these features through convolutional and pooling operations. The convolutional layer utilizes multiple convolutional kernels that slide over the frequency-domain feature maps to extract local patterns and spatial features under different frequency components. The pooling layer performs down sampling on the convolved feature maps, reducing computational complexity while enhancing translation invariance. Ultimately, the CNN outputs the spatial feature representation of the frequency domain, denoted as .
- •
Temporal Feature Extraction (BiLSTM): The time-domain features derived from VMD decomposition are input into the BiLSTM. The BiLSTM consists of two oppositely directed LSTM layers, enabling it to simultaneously utilize both past and future contextual information. This architecture fully captures the long-term dependencies and evolutionary patterns of the signal along the temporal dimension. Through processing by the BiLSTM, a more representative time-domain feature representation, denoted as , is extracted.
3.2.1. Spatial Feature Extraction
The CNN is capable of extracting spatial features from signals. We first apply the FFT to convert the dynamic pressure signal from the time domain to the frequency domain, obtaining a frequency-domain feature matrix. This matrix is then used as input to the CNN, where spatial features within the frequency domain are extracted through convolutional and pooling operations. This approach combines the strengths of FFT in frequency-domain analysis with the powerful feature extraction capabilities of CNNs, enabling effective capture of complex patterns and characteristics present in the dynamic pressure signal.
The Fast Fourier Transform is applied to the acquired dynamic pressure signal
x(
t) to convert it from the time domain to the frequency domain. The mathematical expression for the FFT is as follows:
where
denotes the Fourier transform operation, and
is the frequency-domain representation of the signal
at frequency
f. Through the FFT, we obtain the signal’s frequency spectrum, which contains amplitude and phase information for different frequency components.
The result from the FFT is organized into a format suitable for CNN input. Typically, the spectral data can be treated as a two-dimensional image, where the horizontal axis represents frequency and the vertical axis can represent time (in the case of the Short-Time Fourier Transform) or another dimension, or simply as a one-dimensional vector. Suppose we arrange the spectral data into a two-dimensional matrix , where its element represents the frequency-domain value at a specific frequency and time point.
The organized frequency-domain feature matrix is then used as input to the CNN. The CNN extracts spatial features (patterns within the frequency domain) through convolutional and pooling operations. The specific steps are as follows:
Convolution Operation: The convolutional layer employs multiple convolution kernels to perform convolution operations on the input frequency-domain feature matrix, capturing local patterns under different frequency components. The mathematical expression for the convolution operation is:
where
is the output feature map after the
l-th convolutional layer,
σ is the activation function,
is the weight matrix of the convolution kernel in the
l-th layer, ∗ denotes the convolution operation, and
is the bias vector for the
l-th layer.
Pooling Operation: The pooling layer (typically max pooling or average pooling) performs down sampling on the convolved feature maps. This reduces the dimensionality of the feature maps, improves computational efficiency, and enhances translation invariance of the features. The mathematical expression for max pooling is:
where
is the output feature map after the
l-th pooling layer, and
is the region covered by the pooling window.
Through multiple layers of convolutional and pooling operations, the CNN can progressively extract high-level spatial features from the frequency-domain characteristics. These features can more effectively represent the patterns and structures of the dynamic pressure signal across different frequency components.
3.2.2. Temporal Feature Extraction
To extract more representative features from the dynamic pressure signal, a method combining VMD and BiLSTM can be employed. The detailed description is as follows:
First, the acquired dynamic pressure signal is decomposed using VMD, as described in 2.2. The individual IMF components obtained from VMD are then used as the input for subsequent analysis. Each IMF component can be regarded as a time-frequency domain feature sequence. To facilitate processing by the BiLSTM, these IMF components are appropriately organized and concatenated to form a comprehensive time-frequency domain feature matrix, denoted as . Assuming there are K IMF components, each containing data points for N time instances, can be represented as an M*N matrix, where its element represents the feature value of the i-th IMF component at the j-th time point.
The organized time-frequency domain feature matrix is then input into the BiLSTM. The BiLSTM consists of two oppositely directed LSTM layers, enabling it to simultaneously utilize both past and future context information. This allows for a more effective capture of the dependencies and evolutionary patterns of the signal along the temporal dimension. The BiLSTM integrates the forward and backward processing of the input data, thereby extending the standard LSTM architecture. In this setup, the input sequence is processed by two distinct LSTM networks. One network reads the sequence from start to end (the forward LSTM), while the other reads it from end to start (the backward LSTM). This bidirectional processing allows the BiLSTM to capture dependencies that might not be evident through unidirectional analysis alone, resulting in a more accurate representation of the input data.
The LSTM is capable of capturing long-term dependencies within time series. The core of an LSTM unit consists of a cell state and three gating mechanisms (the input gate, the forget gate, and the output gate). Taking a single LSTM unit as an example, its forward propagation process can be described by the following formulas:
Forget Gate: Determines which information to discard from the cell state.
Input Gate: Determines how much new information should be incorporated into the cell state.
Output Gate: Determines what value is to be output.
where
is the input at time
t,
is the hidden state at time
t − 1,
is the cell state at time
t − 1;
,
,
,
are weight matrices;
,
,
,
are bias vectors;
σ is the sigmoid activation function; and tanh is the hyperbolic tangent activation function.
The BiLSTM obtains the final output features by concatenating the hidden states of the forward LSTM and the backward LSTM. Assume the hidden state of the forward LSTM at time
t is
, and the hidden state of the backward LSTM at time
t is
; then, the output of the BiLSTM at time
t is:
Through the processing of BiLSTM, the long-term dependencies and evolutionary patterns of dynamic pressure signals along the temporal dimension can be fully explored, enabling the extraction of more representative temporal features.
In summary, we first use VMD to decompose the dynamic pressure signal into multiple Intrinsic Mode Functions, obtaining the time-frequency domain feature matrix . This matrix is then input into the BiLSTM, which captures the dependencies and evolutionary patterns of the signal along the temporal dimension through its bidirectional processing mechanism. This method combines the strengths of VMD in signal decomposition with the powerful capability of BiLSTM in extracting temporal sequence features, allowing for the effective extraction of valuable feature information from dynamic pressure signals.
3.3. Spatio-Temporal Feature Fusion
After separately extracting the spatial features from the frequency domain and the temporal features from the time domain, the model employs a cross-attention mechanism to fuse these two types of features. The spatiotemporal feature fusion utilizes a cross-attention mechanism, enabling deep interaction between spatial and temporal features. This allows for more precise identification of weak fault characteristics that are easily overlooked, thereby enhancing the accuracy of surge diagnosis.
3.3.1. Feature Input
Assume that after preprocessing, the spatial features from the frequency domain are represented as with dimensions df × Nf (where df is the feature dimensionality and Nf is the number of features). The temporal features are represented as with dimensions dv × Nv (dv is the temporal feature dimensionality and Nv is the number of temporal features). To perform cross-attention computation, these two features need to be appropriately transformed and aligned.
3.3.2. Cross-Attention Computation
The core of the cross-attention mechanism is to compute the similarity between the Query (Q), Key (K), and Value (V) to determine the degree of association between different features. Here, we can use the spatial feature FFFT as the Query (Q), and the temporal feature as the Key (K) and Value (V), respectively, or vice versa. The following description uses the former as an example.
First, linear transformations are applied to the Query, Key, and Value to map them into the same dimensional space for computing attention weights. Let the transformed Query matrix be
Q, the Key matrix be
K, and the Value matrix be
V. The transformation process can be represented as:
where
,
, and
are learnable weight matrices, whose dimensions are determined based on the input features and the target dimension.
Next, compute the similarity between the Query and the Key to obtain the attention scores. A commonly used method for calculating similarity is the dot product similarity, formulated as follows:
To ensure better numerical stability of the attention scores, they are typically scaled. The scaling factor is
(where
is the dimensionality of the key vectors), resulting in the scaled attention scores:
Subsequently, a softmax function is applied to the scaled attention scores to obtain the attention weights, ensuring that the weights sum to 1:
Finally, a weighted sum of the value matrix is performed according to the attention weights to obtain the fused feature representation
:
The learnable weight matrices , , and are designed to project the input features (dimensions ) and (dimensions ) into a shared latent space of dimension , ensuring compatibility for the attention calculation: , . The attention weights computed via Equation (18) have a clear physical meaning: they represent the adaptive correlation strength between each frequency-domain spatial feature (in Q) and each time-domain temporal feature (in K). For instance, in our case studies, high attention weights were consistently assigned to interactions between low-frequency spectral bands from the FFT features (indicative of surge, as per 2.1) and high-energy transient peaks within specific VMD-IMF components. This demonstrates the mechanism’s ability to automatically focus on and fuse the most salient and mutually reinforcing spatiotemporal signatures of a surge event.
3.3.3. Feature Fusion and Diagnosis
The fused feature
F incorporates deep interactive information between the spatial features from the frequency domain and the temporal features, providing a more comprehensive representation of surge fault characteristics. This fused feature
F is then fed into a fully connected layer (FC) for further feature transformation and integration. The fully connected layer can be expressed as:
where
W is the weight matrix of the fully connected layer,
b is the bias vector, and
σ is an activation function.
Finally, a Softmax layer (SM) maps the output of the fully connected layer to a probability space, yielding the final classification or fault diagnosis result, which is used to determine the type or state of the surge fault. The formula for the Softmax function is:
where
is the
i-th element of the output from the fully connected layer,
C is the number of classes, and
represents the probability that the sample belongs to the
i-th class.
Through the cross-attention mechanism, the model can adaptively focus on the correlations between frequency-domain spatial features and temporal-domain features, facilitating deep interaction between the two. This approach helps capture weak fault characteristics that are easily overlooked when relying on a single feature modality, thereby enhancing the accuracy and reliability of surge fault diagnosis.
In summary, the proposed model combines FFT and VMD for data preprocessing, utilizes CNN and BiLSTM to extract frequency-domain and time-domain features respectively, and realizes spatiotemporal feature fusion via a cross-attention mechanism. This integrated approach leverages the respective strengths of different methods in signal processing and feature extraction, enabling a more comprehensive and accurate analysis of surge fault data.
4. Results and Discussion
In this section, based on the experimental data from aero engine tests and the STFF-CANet diagnosis model, the evaluation metrics and surge diagnosis results of the model are presented.
4.1. Experimental Data
The dataset consists of approximately one million dynamic pressure data points obtained from a 1000 kg-class turbofan engine, covering multiple test scenarios including start-up, acceleration, deceleration, thrust increase, and inlet distortion. All experimental data were obtained under controlled ground test-stand conditions, which are the standard and necessary precursor for validating surge characteristics before flight testing. These data are used as the training, validation, and test sets for the model. The aforementioned STFF-CANet diagnostic model is employed to diagnose engine surge faults.
4.2. Evaluating Indicator
To evaluate the performance of the classification model, this paper employs the metrics of Precision, Recall, F1-score, and Accuracy. These four indicators are calculated to assess the overall capability of the fault detection model.
To distinguish the performance of the diagnostic model, we utilize four fundamental definitions: True Positive (TP), False Negative (FN), False Positive (FP), and True Negative (TN). Their specific definitions are as follows:
TP is the total number of samples where the predicted label is normal and the actual label is also normal.
FN is the total number of samples where the predicted label is normal but the actual label is fault.
FP is the total number of samples where the predicted label is fault but the actual label is normal.
TN is the total number of samples where the predicted label is fault and the actual label is also fault.
In the formulas: TP denotes the number of instances correctly predicted as the positive class, i.e., the count of correctly identified surge faults; FP denotes the number of instances incorrectly predicted as the positive class, i.e., the count of normal states misclassified as faults; FN denotes the number of instances incorrectly predicted as the negative class, i.e., the count of surge faults misclassified as normal; Precision represents the model’s ability to accurately detect faults; Recall represents the model’s ability to identify all surge fault samples within the dataset; and the F1-score indicates the comprehensive performance of the model. To evaluate the degree of privacy protection offered by the model, ε is utilized as an evaluation metric, calculated using the moment accounting method proposed by Mironov.
4.3. Surge Diagnosis Results
To validate the diagnostic effectiveness of the designed STFF-CANet model, this study compares it with other baseline models using the same dataset. Four evaluation metrics—macro-averaged F1-score, Recall, Precision, and overall Accuracy—are employed to assess the performance of the five models. To minimize randomness, each diagnostic model experiment was trained for 50 epochs and tested 20 times on the test set. The final evaluation metrics are derived from the average of these 20 test runs. The fault diagnosis results of different methods on the dataset are presented in
Table 1.
As can be seen from
Table 1, compared to the CNN, LSTM, BiLSTM, and CNN-BiLSTM diagnostic models, the STFF CANet model proposed in this paper achieves the best performance, with its F1 score, Recall, Precision, and Accuracy for surge fault classification reaching 97.96%, 97.52%, 98.43%, and 99.01%, respectively. Furthermore, the experimental results validate that the dataset generation and preprocessing method proposed in this paper is effective for surge fault diagnosis, as all diagnostic models listed in
Table 1 attain a precision above 85%.
As described in 4.1, the dataset used for model training encompassed data from the engine across multiple test scenarios, including start-up, acceleration, deceleration, thrust increase, and inlet distortion.
Table 1 presents the evaluation metrics of the STFF-CANet model across all scenarios, demonstrating its certain robustness to operating modes and flight conditions. The model’s effectiveness stems from its training on a composite dataset incorporating these varied modes, enabling it to learn surge signatures that are invariant to the underlying steady-state operating point but are triggered by specific transient aerodynamic events.
To achieve surge fault detection in aircraft engines, a sliding window-based dataset construction method is adopted. To effectively identify aero engine surge faults, a deep learning-based diagnosis model named STFF-CANet is proposed. First, surge feature analysis is performed on aero engine sensor data in both the time and frequency domains, and a sliding window-based data preprocessing algorithm is applied to construct the dataset and label set. Then, by integrating a convolutional neural network and a long short-term memory network, a deep neural network model tailored for aero engine surge fault diagnosis is designed. Finally, the proposed model is compared with state-of-the-art deep neural networks on the constructed dataset. Experimental results demonstrate that the proposed model achieves an F1 score, Recall, Precision, and Accuracy of 97.96%, 97.52%, 98.43%, and 99.01%, respectively, for surge fault classification, outperforming other network models.
4.4. Ablation Study
To validate the necessity and individual contribution of the core modules in the STFF-CANet framework, an ablation study was conducted. We designed two model variants and compared their performance with the full STFF-CANet model on the same test set. The results of the ablation study are summarized in
Table 2.
Variant A (No Sliding Window): The model was trained and tested on the original, non-overlapping data segments (i.e., without the sliding window augmentation described in 3.1). This directly reduces the number of training samples.
Variant B (No Cross-Attention): The spatiotemporal features extracted by the CNN and BiLSTM were fused using simple concatenation instead of the proposed cross-attention mechanism.
Table 2.
Results of the ablation study.
Table 2.
Results of the ablation study.
| Model Variant | F1_Score/% | Recall/% | Precision/% | Accuracy/% |
|---|
| STFF-CANet | 97.96 | 97.52 | 98.43 | 99.01 |
| Variant A (No Sliding Window) | 91.01 | 90.87 | 89.20 | 87.55 |
| Variant B (No Cross-Attention) | 90.88 | 89.40 | 87.38 | 88.52 |
The performance drop in Variant A confirms the effectiveness of the sliding window method in augmenting limited data and providing richer sequential context, which is crucial for learning transient surge characteristics. The inferior results of Variant B compared to the full model demonstrate that the cross-attention mechanism provides a superior fusion strategy over simple concatenation, enabling adaptive, deep interaction between spatiotemporal features. This ablation study quantitatively validates the necessity and individual contribution of each proposed core module to the overall model performance.
4.5. Discussion on Model Feasibility and Practical Deployment
The practical deployment of the STFF-CANet model in real-world aero-engine health monitoring systems necessitates a rigorous assessment of its real-time performance and adaptation to embedded hardware constraints. To preliminarily assess its real-time potential, inference tests were performed on a high-performance multi-core CPU. The average inference time for processing a single data segment (with the sliding window length defined in 3.1) was approximately 15 milliseconds. This latency is primarily attributed to the sequential operations of the BiLSTM layers and the attention mechanism. While this result is promising, constrained by the current research conditions, our next step is to proceed with deployment and testing on embedded aviation systems. It is possible that further lightweighting of the model will be necessary, which itself represents a valuable and worthwhile direction for further research and exploration. Therefore, while the current model demonstrates high diagnostic accuracy on the available test data, its sustained accuracy over an engine’s lifetime necessitates further investigation. This will be a central focus of our subsequent research as we aim to collect more comprehensive lifecycle data and develop robust online adaptation methodologies.
The baseline models selected for comparison (CNN, LSTM, BiLSTM, CNN-BiLSTM) represent standard deep learning approaches for processing temporal and spatial features in fault diagnosis. The STFF-CANet model, with its targeted cross-attention fusion of complementary spatiotemporal features derived from FFT and optimized VMD, offers a parsimonious and effective solution tailored for the specific challenge of extracting weak surge precursors from limited, high-dimensional sensor data. Future work will focus on collecting more extensive datasets to enable fair and robust comparisons with these advanced architectures and on exploring hybrid models that may incorporate their strengths.