5.1. Interpretability and Comparison with Related Work
Interpretability is a central requirement for data-driven battery prognostic models, particularly when such models are expected to support operational decisions in safety-critical or resource-constrained applications. Previous studies have primarily addressed interpretability through attention visualization techniques, where attention weights are inspected to identify influential cycles or temporal segments. For instance, attention heatmaps have been employed to illustrate cycle-level relevance in RUL estimation tasks [
11].
In contrast, the present study introduces the M-score trend as an additional, complementary interpretability mechanism. Rather than replacing attention-based explanations, the M-score provides a degradation-consistency perspective that enables cross-validation of prediction reliability. By jointly examining the temporal evolution of M-scores and prediction errors, it is observed that smooth and gradually evolving M-score trends are consistently associated with stable and accurate RUL predictions. Conversely, abrupt fluctuations in the M-score often coincide with increased prediction uncertainty. This behavior positions the M-score as a robustness-oriented diagnostic indicator that augments attention-based explanations with an interpretable signal reflecting degradation regularity.
Table 7 summarizes representative studies that have employed different feature extraction and selection strategies on TRI-based datasets.
Reference [
46] introduces an MI-based Bi-LSTM framework that initially extracts 61 handcrafted health indicators from voltage, current, temperature, incremental capacity (IC) curves, and energy-related metrics. These features are then ranked using mutual information to keep the top seven. While the Bi-LSTM model shows stable training behavior, the experimental evaluation is limited to only four battery cells from the TRI/MIT dataset, which restricts the statistical robustness and generalizability of the results to broader degradation patterns. The model’s performance is also extremely sensitive to preprocessing and feature construction choices, such as the selection of voltage windows and IC curve denoising methods. Although inference latency is low, the offline feature extraction and ranking steps increase data preparation overhead and reduce deployment flexibility. Consequently, the approach favors detailed feature engineering over data efficiency and scalability, limiting its practical utility despite promising results on a small, controlled dataset.
Reference [
47] presents the JFO-CFNN framework, which reduces 46 handcrafted health indicators to fifteen features using systematic sampling combined with Jellyfish Optimization. The model is evaluated on a limited set of battery cells (c33–c36), which constrains statistical reliability and limits understanding of how the model performs under varied operating conditions. From an engineering standpoint, the close integration of feature selection with a metaheuristic optimization algorithm increases sensitivity to hyperparameter settings, as feature relevance depends heavily on the optimizer’s control parameters. In addition, the iterative structure of Jellyfish Optimization raises computational demands during training due to repeated fitness evaluations. While the model shows strong accuracy on a small, curated dataset, it emphasizes optimization-driven performance at the expense of transparency, scalability, and deployment efficiency.
Reference [
48] proposes a Parallel Feature Fusion Network (PFFN) that combines statistical cycle-level features with domain-informed indicators using parallel transformer encoders and a feature fusion mechanism. As shown in
Table 7, the model is trained on 41 batteries—significantly more than studies relying on just a few selected cells—enabling better generalization across varied degradation behaviors. This advantage, however, comes with added architectural and optimization complexity. The framework incorporates multi-head attention, parallel transformer blocks, and Bayesian optimization for hyperparameter tuning, which increases sensitivity to tuning choices and extends the training time. While inference is efficient once the model is trained, the feature selection process is largely implicit, limiting interpretability and making it difficult to trace underlying degradation mechanisms. As a result, although PFFN performs well in data-rich environments, its deployment may be less practical in scenarios where fast adaptation, minimal tuning, or model transparency is essential.
Reference [
49] introduces the Positive and Negative Convolution Cross-Connect Neural Network (PNCCN), designed to model battery degradation dynamics directly from voltage, current, and temperature time-series data using specialized PNC and NCC layers. The model is trained and evaluated on 118 lithium-ion cells from the TRI/MIT dataset, with a 60/20/20 split (approximately 71 training, 24 validation, and 23 test cells), offering far broader data coverage than studies limited to a few selected cells. This larger dataset enables exposure to more diverse degradation behaviors but also introduces considerable architectural and training complexity. The framework requires quadratic interpolation to align long time-series data (around 35,000 s per cycle) and training schedules extending up to 2000 epochs, contributing to the significant computational cost. Model performance is also extremely sensitive to hyperparameters, including convolutional filter configurations, nonlinear interaction settings, and training dynamics. Importantly, the substantial gap between training/validation and test RMSE (9.47 vs. 93.58 cycles) suggests a risk of overfitting, even with the large dataset. As a result, while PNCCN offers strong representational power without relying on internal resistance measurements, its deployment entails a trade-off between using rich data, managing computational demands, and ensuring model robustness.
Reference [
50] presents MuRAIN, a multi-time-resolution attention-based interaction network developed for joint estimation of RUL directly from raw cycling data. The model is evaluated on 124 cells from TRI-1 and 45 cells from TRI-2, using a one-third split at the cell level for training, validation, and testing. This yields approximately 41 training cells for TRI-1 and 15 for TRI-2, offering a large and diverse dataset that supports learning across varied degradation patterns. However, this advantage comes with significant architectural and computational complexity. MuRAIN integrates multi-resolution signal patching, stacked multi-head self-attention, and interactive learning modules—components that increase hyperparameter sensitivity, particularly in attention configuration and patch structuring. Additionally, processing raw, high-resolution cycling data through multiple attention layers results in notable training-time delays and memory demands, even though inference is efficient post-training. As a result, while MuRAIN shows robust performance and robustness in data-rich settings, its practical use is better suited to high-capacity computing environments rather than lightweight or rapidly deployable battery health-monitoring systems.
Transformer-based models have also shown promising performance on large-scale battery datasets; however, their effective deployment typically requires extensive hyperparameter optimization and dataset-specific input representations. In this study, the transformer was intentionally implemented as a parameter-matched but non-optimized baseline, and its performance should therefore be interpreted as indicative rather than exhaustive. By contrast, the proposed Bi-LSTM architecture with dual attention explicitly integrates domain-informed feature selection and attention-based temporal modeling within a unified and interpretable framework. Using 36 MCAS-selected features from an initial pool of 161 indicators, the proposed approach achieves a test RMSE of 43.85 cycles on 40 TRI batteries. These results show that combining adaptive multi-criteria feature selection with interpretable attention mechanisms improves both predictive accuracy and transparency by emphasizing degradation-relevant temporal segments and feature interactions.
Table 8 presents a comparative evaluation of the proposed MCAS-guided Bi-LSTM framework with dual attention against several recent state-of-the-art methods evaluated on the SNL multi-chemistry dataset.
Reference [
51] investigates battery health and lifetime estimation using the SNL dataset, applying a cell-level training and testing strategy that captures degradation variability across a range of operating conditions. This broader data configuration enhances model robustness but also introduces sensitivity to dataset-specific factors, such as protocol diversity and measurement resolution, which affect hyperparameter tuning. The method depends on high-resolution cycling data, which increases computational demands during training, though inference remains efficient once the model is trained. The DegradAI Mixture-of-Experts model shows strong chemistry-adaptive performance, reporting a low mean absolute error (≈2.5 × 10
−2 Ah) and R
2 values near 0.99 across LFP, NMC, and NCA chemistries. However, since the specific training and testing cell groups and fixed evaluation splits are not clearly defined, RMSE comparisons should be interpreted cautiously. The reported metrics are best understood as dataset-specific indicators rather than directly comparable performance benchmarks.
Reference [
52] explores early-cycle lifetime prediction using the SNL dataset, using temperature-dependent health indicators extracted from only the first 10 charge–discharge cycles and a lightweight ElasticNet regression model. This approach significantly reduces training data requirements and supports fast deployment with minimal computational overhead. From an engineering standpoint, the model offers high interpretability and negligible inference latency due to its linear structure. However, reported performance varies widely, with RMSE ranging from 37 to 329 cycles and MAPE between 6% and 17%, reflecting sensitivity to data partitioning and variability in operating conditions. These results highlight a trade-off between model simplicity and predictive accuracy. Given the lack of fixed training groups and consistent evaluation splits, the reported RMSE values should be viewed as context-specific rather than directly comparable performance benchmarks.
Reference [
53] reports strong capacity estimation performance on NCA and NCM cells across varying operating conditions, with RMSE values ranging from approximately 0.009 to 0.028 Ah and R
2 consistently between 0.98 and 0.99. The feature importance analysis shows that predictions are primarily influenced by charge/discharge energy metrics and temperature statistics, showing the model’s sensitivity to thermal and energetic degradation patterns. From an engineering standpoint, the results reflect effective modeling of chemistry-specific behavior. However, the evaluation is limited to a predefined set of cells and focuses solely on capacity estimation (Ah), rather than on cycle-based lifetime or RUL prediction. As such, Reference [
53] is excluded from
Table 8, which includes only methods assessed under comparable RUL-focused targets and clearly defined training/testing protocols. While the reported accuracy is high within the study’s specific scope, these results should be interpreted within the context of the SNL operating conditions and target formulation used, and are not directly comparable to RUL-oriented studies.
Despite their respective merits, these approaches primarily rely on static regression formulations or chemistry-specific feature mappings and are therefore limited in their capacity to capture long-range temporal dependencies and complex multi-domain interactions inherent in battery degradation sequences. In contrast, the proposed framework explicitly integrates MCAS-based feature refinement with a dual-attention mechanism, enabling the simultaneous modeling of temporal dynamics and feature-level relevance within a unified recurrent architecture.
Across all SNL chemistries, the proposed model achieves RMSE = 280.5 cycles, MAE = 198.7 cycles, and R2 = 0.9623, indicating strong generalization and robustness, particularly under the highly nonlinear degradation behavior observed in NCA cells. While absolute error magnitudes may vary across studies due to differences in target definitions, evaluation protocols, and units of measurement, the proposed framework consistently demonstrates a favorable balance between predictive accuracy and interpretability. Collectively, the results indicate that combining MCAS-guided feature selection with dual-attention-based temporal modeling supports robust and scalable RUL estimation in heterogeneous battery systems.
5.3. Limitations of the M-Score and MCAS Design Choices
Despite its advantages, several limitations of the M-score should be explicitly acknowledged. First, the indicator does not capture mechanistic distinctions among degradation processes such as lithium plating, SEI evolution, or electrode structural failure. Distinct failure processes may produce similar macroscopic signatures in voltage, temperature, or capacity measurements, resulting in comparable M-score responses despite fundamentally different degradation origins.
Second, certain operating conditions may lead to elevated M-score values without corresponding irreversible degradation. Highly dynamic load profiles, abrupt ambient temperature changes, or transient measurement noise can introduce short-term irregularities that increase entropy-related components of the M-score. When considered in isolation, such effects may produce false-positive indications of degradation inconsistency. This limitation motivates the integration of the M-score within the broader MCAS-guided feature selection and attention-based modeling framework, where noisy, redundant, or temporally unstable features are explicitly penalized.
Conversely, under slow and highly uniform degradation regimes, the M-score may exhibit reduced sensitivity, as conventional health indicators based on capacity fade or resistance growth already provide sufficient prognostic information. In such cases, the incremental benefit of degradation-consistency metrics becomes less pronounced.
Regarding the MCAS mechanism, dataset-specific re-optimization of weighting coefficients could potentially improve numerical performance. However, this strategy was deliberately avoided to prevent overfitting by design and to preserve fair cross-dataset comparison. Instead, MCAS weights were fixed following preliminary tuning and consistently applied across all datasets and chemistries, prioritizing robustness and generalizability over dataset-specific optimality.
5.4. Accuracy–Complexity Trade-Off Considerations
While the proposed MCAS-guided Bi-LSTM framework with dual attention delivers strong predictive performance, its practical use must be considered alongside model complexity. The final architecture includes approximately 1.9 million trainable parameters—striking a deliberate balance between expressive capacity and generalization, particularly for modeling the nonlinear dynamics of battery degradation. To assess this design trade-off, lighter variants with reduced Bi-LSTM depth and smaller hidden dimensions were also evaluated. These models significantly lower the parameter count and computational cost but consistently show reduced accuracy and greater sensitivity to temporal noise and feature redundancy, especially in cases involving delayed or nonlinear aging patterns.
Crucially, the RMSE improvements observed in the proposed model are not simply a result of increased size. Instead, they stem from the effective interaction between MCAS-based feature refinement and the dual attention mechanism, which jointly enhances temporal and feature-level representation. This combination allows the model’s added capacity to be used meaningfully rather than redundantly.
From a deployment perspective, the proposed architecture is suitable for offline training and cloud-based prognostic applications, in which accuracy and interpretability take precedence over strict computational constraints. Meanwhile, the lighter variants offer practical alternatives for resource-limited or real-time BMS applications. As shown in
Figure 13, performance gains tend to plateau beyond a certain model complexity unless guided by informed feature selection—highlighting the key role of MCAS in optimizing the trade-off between accuracy and complexity.
From a theoretical perspective, the computational complexity of the proposed architecture is primarily driven by the stacked Bi-LSTM layers and the dual attention mechanism. For a sequence of length , hidden dimension , and attention key dimension , each Bi-LSTM layer has a time complexity of , due to the recurrent operations in both forward and backward directions. The dual multi-head attention module adds further complexity of . In this design, these costs are carefully managed through several measures: the sequence length is fixed at a moderate 100 cycles, the number of hidden units is progressively reduced across Bi-LSTM layers, and the attention modules are configured with smaller key dimensions to avoid excessive overhead. As a result, the quadratic cost of attention does not dominate the overall computational load.
The MCAS feature selection stage is conducted offline before training and does not impact inference-time complexity. In practice, training is performed offline, while inference requires only a single forward pass, with sub-second latency on CPUs and millisecond-level latency on GPUs. Compared to transformer-only models, whose complexity scales quadratically with sequence length, the proposed hybrid recurrent–attention architecture offers a more balanced trade-off between model expressiveness and computational efficiency. This makes it well-suited for practical battery prognostics applications, where any accuracy improvement must be weighed against computational cost.
Table 9 summarizes the computational performance of the proposed model on representative CPU and GPU platforms. While CPU-based execution enables feasible training and inference, GPU acceleration provides a substantial reduction in training time, achieving approximately an 8–10× speedup depending on dataset size. Inference latency is still significantly lower than training cost in both configurations, with sub-second execution on CPU and millisecond-level latency on GPU. These results show that the proposed model can be efficiently deployed in practical scenarios, particularly in cloud-assisted or edge-computing-based BMSs.