1. Introduction
Permanent magnet synchronous motors (PMSMs) are widely used in industrial and transportation systems, particularly in applications such as electric vehicles, wind energy conversion, and precision manufacturing. Their widespread adoption is largely driven by advantages such as high power density, energy efficiency, and compact design [
1,
2]. In inverter-driven systems, PMSMs function alongside power electronic converters that adjust the supply voltage through high-frequency switching. This enables flexible speed control and enhances overall system performance. However, the same integration also increases system complexity and brings a wider range of potential fault conditions. These include motor-related faults, such as inter-turn short-circuit (ITSC) and demagnetization, as well as inverter-related faults, including open-circuit and short-circuit switch failures and overheating conditions [
3,
4]. Because these electromechanical systems are tightly coupled, a fault in a single component may spread across the system within a short time. A failure of this kind may lead to unplanned downtime. It can also increase maintenance costs and raise safety concerns, especially in critical applications [
5,
6].
Reliable fault detection and diagnosis (FDD) plays an important role in condition-based maintenance of modern electric drive systems. Common signal-processing approaches such as motor current signature analysis (MCSA), Fast Fourier Transform (FFT), and wavelet decomposition are typically used to identify periodic fault patterns, particularly under stable operating conditions [
7,
8]. In inverter-driven PMSMs, these methods tend to lose effectiveness. Current signals are often non-stationary, operating speed varies, and switching harmonics are present, together masking fault-related spectral features and making reliable detection more difficult [
9]. In real-world conditions, degradation does not usually occur in isolation. Different fault types may appear in various forms and with varying levels of severity, reducing the effectiveness of simple binary classification approaches and making more advanced multi-class diagnostic frameworks necessary [
10].
Deep learning has improved fault detection and diagnosis (FDD) in electric drive systems. Convolutional neural networks (CNNs) are especially useful because they can learn spatial features and short-term temporal patterns directly from raw sensor data, reducing the need for manual feature extraction and making the diagnostic process more straightforward [
2,
11]. For time-series fault signals, recurrent architectures such as long short-term memory (LSTM) networks and their bidirectional extension (BiLSTM) are well-suited for capturing sequential dependencies [
4,
12]. Recent studies have also incorporated attention-based mechanisms inspired by transformer architectures. These approaches allow the model to focus on the most informative time steps and feature channels when identifying faults [
13,
14]. Using a single model architecture can limit performance. Hybrid approaches, on the other hand, combine different modeling strengths and usually lead to more robust and accurate results within an end-to-end framework [
15,
16].
Despite their strong performance, many deep learning-based FDD models suffer from limited interpretability. In safety-critical settings, it is not enough to achieve high accuracy. The reasoning behind each fault classification also needs to be clearly understood. SHapley Additive exPlanations (SHAP), which is based on cooperative game theory, offers a model-agnostic framework for quantifying the contribution of each feature to individual predictions [
17]. Recent studies have applied SHAP to vibration-based machine learning models for motor diagnostics [
18] and to multimodal predictive maintenance systems [
19]. Its use in end-to-end deep learning architectures for PMSM inverter fault diagnosis, however, remains limited. Expanding its application in this context would help improve transparency, support operator decision-making, and increase trust in industrial systems.
A hybrid deep learning framework combining CNN, BiLSTM, and a multi-head self-attention mechanism is developed for multi-class fault diagnosis in inverter-driven PMSM systems. The approach is validated using a publicly available multi-sensor experimental dataset consisting of 10,892 samples collected under nine operational conditions [
20]. The architecture sequentially extracts local temporal features via CNN, models bidirectional long-range dependencies via BiLSTM, and applies adaptive time-step weighting via a multi-head self-attention sub-layer with residual connections. A block-aware chronological data splitting strategy is adopted to prevent temporal data leakage, and hyperparameters are selected through a 24-configuration validation sweep. SHAP GradientExplainer analysis is applied post hoc to the trained model, yielding physically interpretable feature importance rankings that are validated against the known mechanisms of each fault class. The principal contributions of this work are as follows: (i) a systematic ablation study quantifying the incremental contribution of each architectural component; (ii) a methodologically rigorous evaluation protocol that avoids temporal leakage in block-structured datasets; and (iii) SHAP-based interpretability analysis that links model decisions to measurable physical sensor signatures across nine fault classes.
The remainder of this paper is organised as follows.
Section 2 reviews related work on deep learning-based PMSM fault diagnosis, attention mechanisms, and SHAP-based interpretability. The experimental dataset and its structural characteristics are presented in
Section 3.
Section 4 outlines the proposed methodology, including preprocessing procedures, data partitioning, model architecture, and the training strategy. Comparative performance results and ablation analysis are reported in
Section 5.
Section 6 provides the SHAP-based explainability analysis.
Section 7 discusses the results in the context of related literature, and
Section 8 concludes with directions for future research.
2. Related Work
Early deep learning approaches to PMSM fault diagnosis relied predominantly on one-dimensional CNN architectures applied to raw current or vibration signals. Song et al. [
6] proposed a multiscale kernel residual CNN for inter-turn short-circuit estimation that demonstrated effectiveness under complex operating conditions. Li et al. [
11] extended this direction by developing a mechanism-based fault diagnosis method using time-frequency image representations, achieving over 98.6% accuracy on ITSC and demagnetisation faults. The repeated finding across this body of work is that CNNs effectively extract discriminative local features from sensor time series, but their inherently fixed receptive fields limit the capture of long-range temporal dependencies that span multiple electrical cycles.
To address the temporal modelling limitations of CNNs, recurrent neural network variants have been incorporated into FDD pipelines. Yan and Hu [
4] demonstrated that a multiscale residual dilated CNN combined with a BiLSTM layer achieved 4.2% higher accuracy than a standalone CNN and 29.06% higher than a standalone BiLSTM for ITSC and demagnetisation fault diagnosis in ship PMSMs, directly motivating the hybrid design adopted in the present study. Yatak [
3] proposed a hybrid deep model for simultaneous inverter-driven and stator winding fault detection in PMSMs, achieving 99.44% and 99.98% accuracy on the two fault categories using multiple signal transforms. Peng et al. [
9] proposed a self-attention-enhanced convolutional architecture capable of diagnosing early PMSM faults under multiple unseen operating conditions by modelling long-range dependencies in two-phase current signals without relying on sequential recurrent structures. Lee et al. [
12] demonstrated that attention recurrent neural networks could reliably estimate ITSC fault severity across varying operating points, establishing the diagnostic value of attention-gated recurrent processing for severity-sensitive applications. Gmati et al. [
21] proposed a BiLSTM-based open-circuit fault diagnosis approach for induction motor drives and reported only marginal accuracy gains from bidirectionality over standard LSTM (98.07% vs. 97.69%), suggesting that the utility of bidirectional temporal modelling is task-dependent and may vary with signal characteristics and fault type.
The recognition that CNN and recurrent components capture complementary aspects of fault signals, local feature patterns versus long-range temporal dependencies, has motivated a growing class of hybrid architectures. Xu et al. [
13] proposed a CNN–LSTM–Attention model for PMSM fault diagnosis that achieved at least 97% accuracy with strong adaptability across common fault types, confirming the generalisability of the hybrid design principle. Yang et al. [
15] developed a hybrid CNN–BiLSTM–Multi-Head Self-Attention model for rotor motor bearing fault diagnosis that achieved 99.33% accuracy under variable speeds and demonstrated stability in real-world conditions. Fan et al. [
14] proposed a large-kernel group convolutional perceptron attention network for ITSC fault diagnosis in PMSMs, in which multi-head self-attention improved both feature representation and interpretability. Overall, these results suggest that combining CNN, BiLSTM, and attention mechanisms generally provides more consistent accuracy gains than using a single model across different PMSM fault scenarios.
Beyond hybrid CNN-recurrent designs, attention-based and transformer-inspired architectures have also been explored for motor fault diagnosis. Zheng et al. [
16] developed an interpretable harmonic-aware dual-branch neural network that achieved 99.90% accuracy and 99.91% F1-score under signal disturbances for open-circuit fault diagnosis in dual three-phase PMSMs, integrating SHAP to support interpretability. Sun et al. [
22] proposed a 1D-CNN–MLP–cross-attention architecture with a golden cosine scheduler, demonstrating the utility of cross-attention for fusing time-domain and frequency-domain feature representations. Their model achieved 99.83% baseline accuracy and demonstrated strong robustness by maintaining over 90% accuracy even under extreme 0 dB noise conditions. The present study adopts a multi-head self-attention sub-layer with residual connections following the BiLSTM encoder, as this configuration has been shown to provide more stable training than full transformer encoders for low-frequency industrial time series.
A parallel research direction emphasises the fusion of heterogeneous sensor modalities to improve diagnostic comprehensiveness and noise robustness. Fan and Hu [
23] reported 98.2% accuracy by fusing vibration, temperature, and electrical signals within an attention-based lightweight architecture suitable for real-time edge deployment. Cömert et al. [
10] demonstrated 100% and 98.95% accuracy for ITSC and inter-coil fault detection by combining current and vibration signals in a data fusion framework. Wang et al. [
5] showed that the synchronised fusion of current and vibration signals, tuned via Bayesian hyperparameter optimisation, improves robustness of severity estimation for early ITSC diagnosis. The present study evaluates a multi-sensor dataset comprising current, DC bus, temperature, and driver voltage measurements, leveraging the complementary fault-discriminative information across these modalities without requiring external sensor synchronisation.
Hyperparameter optimisation has received increasing attention as a means of improving the generalisation and efficiency of deep FDD models. Wang et al. [
5,
24] applied Bayesian optimisation for hyperparameter tuning in CNN-based ITSC diagnosis, demonstrating improved accuracy and reduced model complexity. Zhang et al. [
25] employed a multi-objective tree-structured Parzen estimator to optimise a residual CNN for ITSC fault diagnosis, achieving 99.62% accuracy with improved noise robustness. In the present study, a systematic grid sweep over 24 configurations is used to select CNN filter counts, BiLSTM unit sizes, dropout rates, and learning rates, providing a transparent and reproducible hyperparameter selection protocol. Exhaustive enumeration is computationally tractable for this four-dimensional discrete grid (2 × 2 × 3 × 2) and eliminates the stochastic coverage gaps inherent to random or Bayesian sampling on sparse grids; reproducibility is also maximised since every configuration is evaluated under identical conditions. The number of attention heads (num_heads = 4) was fixed a priori following standard transformer encoder practice and was not included in the sweep, as the search space was already fully enumerated without this dimension. XGBoost hyperparameters, by contrast, were selected via RandomizedSearchCV because that search space is substantially larger and partly continuous, making exhaustive enumeration infeasible; the two strategies are therefore complementary choices matched to their respective search space sizes.
The use of explainable artificial intelligence (XAI) in deep learning-based FDD systems has become more prominent, especially in safety-critical settings where transparency and regulatory requirements play an important role. Shojaeinasab et al. [
17] proposed a unified XAI framework for signal-based models, integrating SHAP-based feature selection with interpretable outputs, and reported improved model simplicity without loss of accuracy. In a related study, Wang and Wang [
18] applied SHAP to vibration-based machine learning models for motor fault diagnosis, showing that interpretability can enhance both model reliability and alignment with physical system behavior. Sharma et al. [
26] incorporated SHAP into an ensemble model combining CNN, LSTM, and random forest, achieving strong classification performance and enabling real-time analysis of feature contributions. In another study, Awan et al. [
27] developed an explainable framework for power electronics fault diagnosis using LIME, SHAP, and attention mechanisms, and evaluated its interpretability on both simulated and real-world datasets. Despite these advances, the coherent integration of SHAP with deep hybrid CNN–BiLSTM–Attention architectures applied to multi-class PMSM inverter fault datasets remains an underexplored area, constituting the primary interpretability contribution of the present study.
Taken together, the surveyed literature demonstrates that while individual components, CNN-based feature extraction, BiLSTM temporal modelling, multi-head attention, and SHAP interpretability have each been validated independently for motor fault diagnosis, their systematic combination within a single end-to-end framework evaluated under a rigorous non-leaking temporal split protocol on a multi-fault PMSM inverter dataset has not been previously reported. The majority of existing studies either employ random data splits that inflate performance estimates by distributing time-adjacent samples across training and test partitions, or they limit interpretability to global attention visualisations without per-feature Shapley attributions at the fault-class level. The present study addresses both gaps simultaneously, contributing a methodologically robust evaluation and physically validated explainability analysis for the nine-class PMSM inverter fault diagnosis task.
3. Dataset Description
The experimental evaluation uses the publicly available multi-sensor PMSM inverter fault dataset introduced by Bacha [
20]. The dataset was collected from a custom-built laboratory test bench comprising a three-phase two-level MOSFET inverter powered by a 15 V DC supply and driving a PMSM converted from a DENSO car alternator. Data acquisition was performed at a sampling frequency of 10 Hz using an Arduino-based system, with motor speed regulated at a constant 10 rad/s via Field-Oriented Control throughout all recording sessions. Speed variation primarily affects the fundamental electrical frequency and the amplitude of current harmonics, whereas load variation modulates steady-state current magnitude and thermal dissipation patterns. Holding the operating point constant ensures that signal variations across fault classes are attributable to fault conditions rather than changes in external loading.
The dataset contains 10,892 samples organised into nine operational classes, as presented in
Table 1: one normal operating condition (F0) and eight fault scenarios covering high-side open-circuit faults (F1), low-side open-circuit faults (F2), low-side short-circuit faults (F3), high-side short-circuit faults (F4, F5), and overheating conditions affecting individual or multiple half-bridge modules (F6, F7, F8). Each sample comprises eight raw sensor measurements, two phase currents (Ia, Ib), DC bus voltage (VDC), DC bus current (IDC), three half-bridge temperatures (T1, T2, T3), and driver voltage (VD), together with fifteen derived features including physical unit conversions, DC power, AC power, current imbalance, maximum temperature difference, normalised currents, and moving averages and rates of change for key signals.
The distribution of the analysed classes is shown in
Figure 1, where F0 constitutes 39.4% of total samples while fault classes range from 3.1% (F4) to 15.9% (F7), indicating a noticeable imbalance typical of real-world data. The dataset has a temporal structure of nine consecutive blocks, each corresponding to a single experimental recording session conducted under a specific operating condition. This block structure affects the experimental design and is discussed further in
Section 4.2.
Figure 2 shows the Pearson correlation heatmap for the eight raw sensor channels. A clear negative correlation appears between Ia and Ib (r = −0.77), consistent with the expected phase relationship in a three-phase system. Moderate positive correlations were observed among the temperature sensors T1, T2, and T3 (r = 0.29–0.52), reflecting thermal interactions between adjacent half-bridge modules. The voltage-related channels (VDC, VD, IDC) show weak correlations with both current and temperature signals, indicating that each sensor group captures distinct aspects of system behavior. This distinction supports the multi-sensor fusion approach adopted in the proposed framework.
Figure 3 illustrates the temporal evolution of current imbalance and maximum temperature difference across the complete dataset. In the OC and SC fault regions (F1–F5), current imbalance shows clear transient peaks rather than a stable pattern, whereas the overheating regions (F6–F8) display a gradual increase in temperature difference. These distinct responses suggest that each fault type produces its own characteristic signature in the sensor signals.
Figure 4 presents per-class box plots for T1 and Ia, further illustrating the separation between overheating faults, characterised by elevated T1 distributions, and OC/SC faults, distinguished by their Ia amplitude and spread patterns.
4. Methodology
4.1. Pre-Processing and Feature Selection
Among the 25 available features, eight features were excluded from the model input based on redundancy and data quality considerations. The Timestamp column encodes data acquisition order and carries no physical sensor information. Ia_arduino and Ib_arduino are alternative calibration estimates of Ia_original and Ib_original with Pearson correlation coefficients exceeding 0.99, and their inclusion would introduce near-perfect multicollinearity without adding diagnostic information. IDC_arduino and IDC_original are both derived from the same DC current channel and are therefore redundant. Ia_Normalized and Ib_Normalized contain 28 infinite values resulting from division-by-zero in low-current transients; since the raw current signals are normalised by StandardScaler within the pipeline, these features are not required. The final feature set consists of 18 variables: eight raw sensor measurements (Ia, Ib, VDC, IDC, T1, T2, T3, VD) and ten derived features (Ia_original, Ib_original, Power_DC, Power_AC, Current_Imbalance, Temp_Diff_Max, VDC_RateOfChange, IDC_RateOfChange, VDC_MovingAvg, IDC_MovingAvg).
Data quality issues identified prior to splitting were addressed as follows. Three features contained infinite values (Current_Imbalance: 21; Ia_Normalized: 14; Ib_Normalized: 14) arising from division operations on near-zero denominators; excluded features were not affected. Four features contained isolated NaN values (VDC_RateOfChange: 1; IDC_RateOfChange: 1; VDC_MovingAvg: 9; IDC_MovingAvg: 9) resulting from differencing and windowing operations at the start of each recording. Infinite values were replaced by NaN, and all NaN values were imputed using the training-set median of the corresponding feature, computed prior to applying the fill values to the validation and test splits. This imputation order preserves temporal data integrity and prevents any information flow from the test partition into the training pipeline.
4.2. Block-Aware Data Splitting
The dataset is organised in temporal blocks, with each fault class captured as one continuous recording segment. A similar structure has been reported in other inverter-driven synchronous motor fault datasets [
29]. A randomly stratified split would distribute time-adjacent samples across partitions, constituting data leakage and producing optimistically biased accuracy estimates.
To address this, a block-aware chronological splitting strategy is adopted. Within each class block, samples are partitioned sequentially into 65% training, 20% validation, and 15% test sub-blocks. Windows are subsequently constructed independently within each sub-block, ensuring that no sliding window crosses a training–validation–test boundary. This procedure yields 1393 training windows, 413 validation windows, and 306 test windows across all nine classes. The increased validation fraction of 20% (compared to the symmetric 15%/15% split used in preliminary experiments) was adopted to stabilise hyperparameter selection, as a 305-window validation set produced several degenerate configurations with perfect validation F1 = 1.0 due to insufficient sample diversity.
The temporal structure visible in
Figure 3, transient current imbalance peaks in F1–F5 and a gradual temperature ramp-up in F6–F8, raises the question of whether boundary artefacts near block edges influence evaluation. Because the sequential split is applied within each class block, the early transient region of each class falls predominantly in the training sub-block. The test sub-block corresponds to the trailing 15% of each class recording, capturing quasi-steady-state fault signatures after startup transients have settled. Windows are constructed exclusively within each sub-block with no window crossing a block boundary, so neither the training–validation nor the validation–test boundary generates contaminated windows that mix transient and steady-state samples. The reported test performance therefore reflects the model’s ability to diagnose established fault conditions; fault onset detection represents a distinct diagnostic problem requiring a different experimental protocol.
4.3. Sliding Window Construction
Sequential sensor measurements are organised into fixed-length sliding windows to provide temporal context for the recurrent and attention components of the proposed architecture. A window size of w = 15 samples (1.5 s at 10 Hz) and a stride of s = 5 samples (0.5 s) were selected based on the window size sensitivity analysis presented in
Section 5.3. At 10 Hz, the acquisition system resolves thermal dynamics and steady-state current asymmetry, the persistent, low-frequency signatures that each fault class imprints on the sensor channels, but cannot capture high-frequency switching transients such as PWM-induced voltage spikes at the inverter switching frequency, which typically falls in the kHz range. The 1.5 s window captures the slowly evolving fault envelope rather than instantaneous switching behaviour, which is consistent with the nature of the features retained in the final feature set. Regarding electrical cycle coverage, the motor operates at a constant speed of 10 rad/s; the fundamental mechanical frequency is therefore approximately 1.6 Hz, and the 1.5 s window spans roughly 2.4 mechanical cycles, which is sufficient to capture the repeating current asymmetry and thermal patterns that characterise each fault class under steady-state conditions. Each window is represented as a tensor of shape (15, 18) corresponding to 15 consecutive time steps and 18 input features. Windows are constructed per class within each split; no window crosses a class boundary, preserving the semantic integrity of fault episodes.
StandardScaler normalisation is applied by fitting the scaler on the training split feature matrix and applying the identical transformation to the validation and test splits. Class imbalance is handled through weighted loss during deep learning training (class weights computed using sklearn compute_class_weight with the balanced strategy) and through sample weights for the XGBoost baseline model.
4.4. Proposed Architecture: CNN–BiLSTM–Attention
The proposed model processes windowed multi-sensor sequences through four sequential functional blocks, as illustrated schematically in
Table 2.
The CNN block consists of two one-dimensional convolutional layers with ReLU activation, batch normalisation, and dropout. The first layer applies fc filters of kernel size 3 to the input sequence, extracting local temporal patterns across the 18-dimensional feature space. The second layer doubles the filter count to 2fc, deepening the hierarchical representation. MaxPooling is deliberately omitted to preserve the temporal resolution required by the subsequent recurrent layers.
The BiLSTM block comprises two stacked bidirectional LSTM layers. The first processes the CNN output sequence in both forward and backward directions using lu × 2 recurrent units per direction, capturing long-range temporal dependencies from both past and future context within the window. The second BiLSTM layer uses lu recurrent units per direction, producing a sequence of hidden states of dimensionality lu × 2 that serves as input to the attention sub-layer.
The multi-head self-attention sub-layer computes scaled dot-product attention with four heads and key dimensionality 16 over the BiLSTM output sequence. A residual connection adds the attention output to the BiLSTM sequence, followed by layer normalisation. A position-wise feed-forward sub-layer with dimensionality lu × 2 and dropout is then applied with a second residual connection and layer normalisation, following the standard transformer encoder block design [
30].
The classifier head applies GlobalAveragePooling1D to aggregate the attended sequence into a fixed-size vector, followed by two dense layers with ReLU activation and dropout, and a final softmax output layer with nine units corresponding to the nine operational classes.
4.5. Hyperparameter Selection
Hyperparameters for the proposed model were selected through a systematic grid sweep over 24 configurations covering: filters fc ∈ {64, 128}, LSTM units lu ∈ {32, 64}, dropout rate ∈ {0.1, 0.2, 0.3}, and learning rate lr ∈ {10
−3, 5 × 10
−4}. Each configuration was trained for up to 60 epochs with early stopping (patience = 8) and evaluated on the validation set using macro F1.
Figure 5 presents the mean validation F1 and standard deviation for each hyperparameter value, averaged across all configurations sharing that value. The selected best configuration, fc = 64, lu = 32, dropout = 0.2, lr = 10
−3, achieved a validation F1 of 1.000 and was selected as the most parsimonious configuration at the top of the ranking. All ablation and competitor models were constructed using the same fc, lu, and dropout values to ensure fair component-wise comparison.
XGBoost hyperparameters were selected via RandomizedSearchCV with 30 iterations and five-fold stratified cross-validation on the training windows, optimising for macro F1. The search covered n_estimators ∈ [100, 500], max_depth ∈ [3, 9], learning_rate ∈ [0.01, 0.20], subsample ∈ [0.6, 1.0], colsample_bytree ∈ [0.6, 1.0], and min_child_weight ∈ [1, 10]. The best configuration achieved a cross-validation macro F1 of 0.9657.
4.6. Training Protocol
All deep learning models were compiled with the Adam optimiser and sparse categorical cross-entropy loss. Training proceeded for up to 100 epochs with early stopping (patience = 12, monitor = val_loss, restore_best_weights = True) and ReduceLROnPlateau (factor = 0.5, patience = 5, minimum learning rate = 10
−6). A mini-batch size of 32 was used throughout. Class imbalance was addressed by passing per-class weights to the Keras class_weight argument during training, computed as described in
Section 4.3.
To quantify stochastic training variability, each deep learning model was trained five times using independently initialised random seeds (42, 43, 44, 45, 46). At each run, all stochastic elements, weight initialisation, mini-batch sampling order, and dropout masks are re-seeded independently. Performance metrics are recorded on the fixed held-out test set for each run, and the standard deviation across the five values quantifies sensitivity to training stochasticity. This protocol does not incorporate data resampling uncertainty such as bootstrap confidence intervals, as the block-aware split is fixed to preserve temporal integrity; the reported standard deviation therefore reflects initialisation and optimisation variance only. Classical ML models are deterministic given a fixed random seed; therefore, single-run results are reported for these baselines without standard deviation. All experiments were conducted on Google Colaboratory using an NVIDIA Tesla T4 GPU (16 GB VRAM). The five independent training runs for the proposed model were completed in approximately 6 min in total (approximately 1–2 min per run), with early stopping terminating training before the 100-epoch limit in all runs. The complete experimental pipeline, including exploratory analysis, hyperparameter sweep, all baseline and ablation model training, and SHAP attribution computation, was completed in under 50 min.
4.7. Evaluation Metrics
Model performance is assessed using four complementary metrics that together provide a comprehensive view of classification quality under class imbalance. Macro-averaged F1-score (Macro F1) assigns equal weight to each class regardless of sample count and is therefore sensitive to minority-class performance. Matthews Correlation Coefficient (MCC) is a single scalar summary of the full confusion matrix that accounts for all four contingency cells simultaneously and is considered one of the most informative metrics for multi-class imbalanced classification. Overall accuracy measures the fraction of correctly classified test windows. The macro-averaged Area Under the ROC Curve (AUC) measures how effectively the model separates classes across different decision thresholds. Precision, recall, and F1-score are also reported for each class individually.
6. Explainability Analysis
Model interpretability is provided through SHAP GradientExplainer applied to the complete trained CNN–BiLSTM–Attention model, from the raw input layer to the nine-unit softmax output. The explainer backpropagates gradients from each class-specific output neuron through the full forward pass, CNN feature extraction, BiLSTM temporal encoding, multi-head self-attention, and global average pooling, to the raw input tensor of shape (15, 18), using 100 randomly sampled training windows as background references and computing attributions for 200 test windows. The resulting SHAP arrays have shape (n_samples, T, F) per class; the time dimension is averaged (mean |SHAP| over T = 15 steps) to yield per-feature importance vectors of shape (F = 18) for each class, and global importance is obtained by averaging across all nine classes. This temporal averaging is appropriate because the diagnostic question of interest concerns which sensor channels are most discriminative for each fault class, rather than which specific sub-second intervals within the 1.5 s window are most informative. Attributions therefore reflect end-to-end contributions through all architectural components rather than being local to any single intermediate layer. GradientExplainer provides gradient-based approximations of Shapley values rather than exact solutions; for architectures containing recurrent components such as BiLSTM, the approximation quality is bounded by the smoothness of the gradient landscape, and the resulting attributions should therefore be interpreted as directional indicators of feature importance rather than precise Shapley values.
Figure 7 presents the global feature importance ranking. Temperature-related features dominate: T1 (half-bridge 1 temperature) and Temp_Diff_Max (maximum temperature differential) rank first and second, with mean |SHAP| values of 0.0083 and 0.0077 respectively, followed by T2 (0.0050) and T3 (0.0027). Current-related features occupy intermediate positions: Ib and Power_AC rank fifth and sixth, while Current_Imbalance, Ib_original, and Ia_original follow closely. Voltage and DC bus features (VDC, IDC, VD) consistently rank in the lower tier.
The per-class SHAP heatmap is shown in
Figure 8, highlighting different sensor activation patterns for each fault type. For overheating faults (F6, F7, F8), T1, T2, and Temp_Diff_Max show the largest SHAP values, consistent with the direct thermal response of these sensors during half-bridge overheating. For open-circuit faults (F1, F2) and short-circuit faults (F3, F4, F5), current-related features, particularly Ib, Ia_original, and Current_Imbalance, show elevated importance. For normal operation (F0), feature importance remains low across both temperature and current signals, suggesting that the model relies on the absence of strong distinguishing patterns. These results are consistent with the expected physical behavior of the system and reinforce confidence in the model’s diagnostic decisions [
16,
17,
18,
28].
7. Discussion
The experimental results highlight several important points. The CNN–BiLSTM–Attention architecture delivers the strongest performance on the Bacha [
20] dataset, with higher accuracy and MCC values than both baseline and ablation models. The ablation analysis indicates that each component contributes to the final outcome: CNN layers extract local patterns, BiLSTM layers capture temporal relationships, and the attention mechanism adjusts the importance of different time steps. The choice of a block-aware chronological split is also important for this dataset. Each fault class is recorded as a single continuous sequence, so a random split would place time-adjacent samples in both training and test sets, inflating performance estimates. Preserving the temporal structure leads to a more realistic evaluation of model generalisation. SHAP analysis further clarifies how the model responds to different fault types. For overheating faults (F6, F7, F8), temperature sensors account for a large share of the importance (33.2%), with T1 and Temp_Diff_Max standing out in the global ranking, consistent with their direct response to thermal effects during half-bridge heating. For OC and SC faults, current imbalance and phase current features become more prominent, aligning with the known role of stator current asymmetry as a diagnostic indicator for switch-level faults in three-phase inverters [
31,
32].
Several limitations of the present study should be acknowledged. The dataset was collected under fixed operating conditions (constant rotor speed of 10 rad/s, DC bus voltage of 15 V, and ambient temperature of 25 °C), representing a deliberate laboratory simplification. In field deployments, motor speed varies continuously, load torque fluctuates, and ambient temperature spans a wide range; each of these variations modulates the baseline current, voltage, and thermal signatures from which fault features are extracted, potentially shifting class boundaries and degrading model performance. Validating the proposed framework under variable-speed, variable-load, and wide-temperature protocols remains a priority direction for future work and will likely require either multi-condition experimental datasets or physics-informed domain adaptation strategies. The 10 Hz sampling rate is sufficient to capture thermal and steady-state current dynamics but may not resolve the high-frequency switching transients that provide early fault signatures at higher acquisition rates. The GradientExplainer approach provides gradient-based approximations rather than exact Shapley values; for models with recurrent components, the approximation quality is theoretically bounded, and the importance values should be interpreted as directional indicators rather than precise attributions. Furthermore, while SHAP GradientExplainer identifies which sensor channels are most discriminative at the feature level, it does not directly reveal which time steps within the 1.5 s window the attention mechanism focuses on. Attention weight visualisation across time steps would complement the present feature-level analysis and is left for future work.
8. Conclusions
This study proposed a hybrid CNN–BiLSTM–Attention deep learning framework for multi-class fault diagnosis in inverter-driven PMSM systems and evaluated it on a publicly available multi-sensor experimental dataset spanning nine operational conditions. The proposed model shows the best performance in terms of accuracy (0.9810 ± 0.0102) and MCC (0.9757 ± 0.0130) when compared with all evaluated alternatives, including classical ML approaches, sequence-based models, and a CNN–Transformer architecture. It is noted, however, that the Random Forest baseline attains a higher macro F1 score (0.9747 vs. 0.9681), reflecting the partial separability of fault classes through aggregated cross-sensor features without temporal modelling; this finding underscores the importance of reporting complementary metrics when evaluating diagnostic models under class imbalance. Results from the ablation analysis indicate that each component contributes to the overall performance. In addition, SHAP GradientExplainer analysis offers feature importance rankings that are consistent with known fault mechanisms and provide meaningful physical interpretation.
The adoption of a block-aware chronological data splitting strategy, five-run statistical reporting, and validation-set-based hyperparameter selection represents a methodologically rigorous evaluation framework that avoids common sources of inflated performance estimates in temporal FDD benchmarks. Future work will investigate the generalisation of the proposed architecture to multi-speed and multi-load operating conditions, the incorporation of physics-informed constraints as regularisation terms to enforce thermodynamic and electromagnetic consistency, and the deployment of lightweight model variants suitable for embedded edge inference in real-time predictive maintenance systems.