1. Introduction
In light of the escalating depletion of fossil energy sources and mounting environmental concerns, there is a pressing demand for sustainable alternatives. Among various clean energy technologies, proton exchange membrane fuel cells (PEMFCs) have attracted widespread interest due to their high efficiency and environmental compatibility. These systems produce electricity through electrochemical reactions between hydrogen and oxygen [
1]. Thanks to characteristics such as low working temperatures, rapid startup capability, excellent energy conversion efficiency, and zero harmful emissions, PEMFCs are well-suited for various sectors, including portable power, electric vehicles, and aerospace [
2]. Despite these strengths, commercialization is hindered by factors like the reliance on costly platinum-based catalysts and their vulnerability to contaminants like CO and sulfur compounds, which degrade catalytic performance and reduce stack durability [
3]. To address these limitations, constructing an accurate RUL prediction approach for PEMFCs is essential, as it enables proactive maintenance planning, extends service life, and helps reduce maintenance expenditures [
4].
A typical PEMFC structure is composed of several key components, including bipolar plates, proton-conducting membranes, catalytic layers, and gas diffusion structures [
5]. The proton exchange membrane, usually made of perfluorosulfonic acid polymer (e.g., Nafion), serves as a solid polymer electrolyte, facilitating proton transfer while isolating fuel and oxidants. Platinum catalysts are widely adopted to accelerate the redox processes at the anode and cathode sites, promoting efficient electrochemical conversion within the fuel cell. Bipolar plates, critical structural elements, handle gas distribution, electrical conduction, and water and thermal management. The gas diffusion layer consists of a substrate and a microporous layer, which together promote uniform distribution of gases and facilitate efficient electron transport [
6].
RUL estimation techniques for PEMFCs are typically categorized into three major classes: physics-based models, machine learning approaches, and hybridized frameworks combining both [
7]. In physics-based models, RUL is estimated by considering variables such as operating loads, material characteristics, degradation dynamics, and underlying failure phenomena. These models encompass mechanistic, empirical, semi-empirical, and probabilistic approaches, such as Kalman filters and particle filters [
8]. These methods encompass mechanistic, empirical, and hybrid forms such as semi-mechanistic and semi-empirical models, with Kalman Filters (KF) and Particle Filters (PF) serving as typical examples. For instance, a model-driven approach was proposed as an aging-tolerant control strategy, effectively mitigating the effects of aging on performance by dynamically adjusting input parameters [
9]. This strategy combined health state assessment with model inversion, enabling the development of RUL prediction algorithms validated through simulations, which proved effective in power prediction and lifespan estimation. Similarly, Liu et al. [
10] developed a prediction model that integrates a semi-empirical method for capturing degradation behavior with an adaptive version of the unscented Kalman filter (AUKF). This approach effectively assessed fuel cell health status and remaining useful life, while also improving parameter tuning and demonstrating superior performance compared to the traditional UKF in simulation tests.
Data-driven methods learn degradation patterns from historical data, from which RUL can be estimated [
11,
12]. Commonly employed techniques encompass ANN, RVM, ANFIS, and a range of other intelligent learning algorithms [
13]. For example, GRU-based neural architectures have been utilized for RUL forecasting in hydrogen fuel cell applications [
14]. By preprocessing data, extracting features, and comparing various neural network models, their study found that GRU achieved superior accuracy, faster convergence, and significantly lower prediction errors. The RUL is estimated using a method that integrates sparse autoencoders (SAE) with deep neural networks (DNN), providing a reliable approach for analyzing complex data [
15]. Through data smoothing and automatic feature extraction, this method achieved high accuracy in lifespan prediction under dynamic conditions using DNN. In recent years, researchers have investigated various deep learning frameworks for RUL prediction, including Transformer-based architectures [
16], temporal convolutional networks (TCNs) [
17], and reinforcement learning frameworks for RUL prediction. These models can capture long-range dependencies or exploit exploration strategies, but they typically require large-scale training data and high computational resources, limiting their use in resource-constrained fuel cell applications.
Hybrid methods integrate multiple approaches to overcome the limitations of individual methods, offering enhanced performance [
18]. Liu et al. [
19] proposed a dual-phase hybrid strategy for predicting the remaining useful life (RUL). In the first phase, adaptive neuro-fuzzy inference systems (ANFIS) are optimized using particle swarm optimization (PSO) to simulate long-term degradation patterns. The second phase employs a semi-empirical model in conjunction with an adaptive unscented Kalman filter (AUKF) to estimate the RUL. Experimental results confirm the method’s effectiveness in delivering reliable long-term degradation predictions and RUL assessments, with automated tuning of model parameters. To enhance prediction accuracy, researchers proposed a hybrid model that integrates LSSVM with RPF, aiming to refine performance and capture uncertainty more effectively [
20]. This method capitalized on the data-driven capabilities of LSSVM and the precision of RPF models, offering not only accurate RUL predictions but also probabilistic uncertainty quantification.
Existing PEMFC degradation prediction methods still have many shortcomings. Firstly, the manual selection of features usually relies on experience, which makes it difficult to deal with nonlinear relationships, as well as high-order interaction terms, and is often prone to model redundancy. Secondly, conventional CNN and RNN architectures suffer from limited receptive fields and struggle to capture long-term dependencies. They also face difficulties in identifying critical time steps and are susceptible to issues like vanishing or exploding gradients [
21]. Finally, empirical hyperparameter tuning is strongly subjective, inefficient and poorly reproducible [
22]. To address the above challenges, this study introduces an innovative RUL prediction framework that integrates a random forest for feature extraction with a two-layer BiLSTM [
23] model enhanced by attention mechanisms. The methodology consists of the following stages: (1) Employing a random forest algorithm to automatically identify relevant features from the input dataset. (2) Building on the conventional LSTM framework, a dual-layer BiLSTM integrated with an attention mechanism is applied to strengthen the modeling ability for fuel cell voltage time-series data. (3) Optuna is employed to automate hyperparameter tuning and dynamically optimize experimental resource distribution. Combined with early stopping, this approach enhances model effectiveness while maintaining the consistency and reliability of experimental outcomes [
24]. (4) Designed and conducted fuel cell stack aging tests for model performance evaluation.
The primary innovations of this research include the following:
(1) To improve the extraction of crucial features and emphasize key patterns in time-series degradation data, a two-layer bidirectional LSTM is integrated with an attention mechanism. This combined architecture enhances the model’s focus and interpretability during forecasting.
(2) The Optuna framework performs automatic optimization of hyperparameters to achieve structural optimization under a fully automatic search space and enhance the robustness of the prediction model.
(3) Conduct degradation tests on PEMFC stacks and gather authentic experimental data to assess and confirm the effectiveness of the developed forecasting approach. This process helps enhance the training dataset’s relevance and boosts the reliability of the evaluation strategy.
The organization of this article is as follows.
Section 2 presents the design principles underlying the prediction scheme.
Section 3 describes the comprehensive forecasting method.
Section 4 illustrates the experimental findings and includes relevant validation and analysis. Lastly,
Section 5 concludes the study and discusses possible directions for future investigations.
This research introduces a forecasting approach that combines random forest (RF), a two-layer bidirectional LSTM enhanced with an attention mechanism (BiLSTM-AT), and hyperparameter optimization using Optuna version 4.4.0. This integrated framework is hereafter termed the “proposed model.”
Figure 1 shows the framework of the proposed model. To emphasize its originality, the model’s performance is benchmarked against several recent methods, as illustrated in
Table 1.
3. Prediction Process
This section introduces the developed model designed to estimate PEMFC degradation and forecast its remaining useful life.
3.1. Data Description
To study the health state trends of fuel cells and develop estimation methods, long-term and extensive aging test data are required. Moreover, the mechanism of health state changes in fuel cells is a complex problem involving multiple influencing factors and couplings. In this work, the data utilized originate from the open-access IEEE PHM 2014 dataset, which was published by the French FCLAB research center specializing in fuel cells [
42]. The experimental subject is a PEMFC; the fuel cell system comprises a power unit, a fuel cell stack, a gas supply subsystem, various sensors, data acquisition components, and a control unit. Among these, the data acquisition system can record the physical parameters of the stack operation in real time, while the control system is provided by National Instruments. Prior to entering the stack, both air and hydrogen are routed through their respective humidification units. Thermal management is achieved by controlling the cooling water temperature, and current variations are managed by controlling the electronic load. The polarization curves of the fuel cell at different time points are shown in
Figure 7.
This study employs a dataset consisting of two PEM fuel cell stacks: FC1, tested under constant operating conditions, and FC2, evaluated in a dynamically varying environment. FC1 runs under a stable loading scenario with a constant 70 A current, while FC2 experiences a semi-dynamic profile, where a 70 A base current is modulated by a 5 kHz triangular waveform at 10% amplitude.
Critical experimental factors include aging time, voltage at each cell, current levels, inlet/outlet temperature and pressure readings, gas (hydrogen/air) flow rates, cooling water characteristics, and air humidity at the inlet. The parameters of the test stack and the specific operating parameters for providing a stable environment for stack operation are listed in
Table 2.
Figure 8 shows the polarization curve of FC2. Compared to the polarization curve at the initial time point (0 h), the polarization curves measured at 185 h, 348 h, and 515 h are significantly lower than the initial state. They exhibit a clear trend of declining curve height as operation time increases. This phenomenon indicates a progressive decline in the open-circuit voltage (OCV) of the fuel cell as the operating time extends. Under identical current density conditions, the output voltage exhibits a further decrease, accompanied by a noticeable variation in ohmic resistance. Once the performance degradation of the stack reaches a certain threshold, the polarization curve may fail to span the entire range of current densities.
By observing the electrochemical impedance spectroscopy (EIS), it can be noted that as the stack performance degrades, both the imaginary and real parts of the impedance show an increasing trend, and the coordinate values corresponding to the two intersections with the real axis gradually increase. This suggests that internal processes such as mass transport, electrochemical reactions, and ionic conduction within the stack are hindered, further contributing to the performance decline.
3.2. Data Preprocessing
The dataset used includes two groups of data, FC1 and FC2. FC1 contains 25 data labels, with each label consisting of 143,862 data points, while FC2 also contains 25 data labels, with each label consisting of 127,370 data points.
Figure 9 illustrates some of the aging parameters for FC1 and FC2.
Through preliminary screening and analysis of the data, it was found that both datasets contain a significant number of outliers and noise. Therefore, further processing is required, which is carried out in two steps.
3.2.1. Outlier Removal
The detection of outliers is performed using the Interquartile Range (IQR) method to identify and remove outliers. The mathematical formula is as follows:
where
is the first quartile (25% of the data), and
is the third quartile (75% of the data).
Outlier Detection Boundaries:
where
is the threshold for outlier detection, set to
in this case.
Data Removal: Samples in the data that fall outside the range are considered outliers and are removed from the dataset.
3.2.2. Data Smoothing
A widely adopted technique for reducing noise in time series data is the moving average, which computes the mean of observations within a defined sliding window to achieve smoothing effects. The mathematical formula is as follows:
In this formula, indicates the value after smoothing at time point t, where n refers to the overall size of the smoothing window. The term refers to the raw data at time i, and corresponds to half the width of the smoothing interval. For this research, a window length of n = 5 was selected and kept constant.
Figure 9 illustrates a distinct decline in stack voltage as time progresses, which aligns with typical fuel cell degradation patterns over their lifespan. Consequently, the emphasis of this work lies in preprocessing the stack voltage signals, with the filtered voltage trajectories of FC1 and FC2 shown in
Figure 10.
3.3. Feature Selection
This research employed the Random Forest algorithm to assess feature relevance and measure the impact of each input on the target variable Utot(V). In practice, Random Forest models were constructed separately for static and dynamic data. The model’s built-in feature evaluation mechanism was employed to assess the relative significance of each input variable. We excluded single-cell voltage U1~U5 and current density J due to the difficulty of obtaining single-cell voltage data in practical applications and the fact that current density J varies in the same way as current I [
43]. For the remaining features, Random Forest was applied independently to FC1 and FC2 to obtain their respective feature rankings, which are visualized in
Figure 11. The results show that there is a difference in feature importance ranking between static and dynamic data.
In FC1, the importance of air outlet flow (DoutAIR) is as high as 71.4%, which is much higher than the other characteristics, indicating that the gas output side has a decisive influence on the performance degradation, while the air inlet temperature (TinAIR) becomes the dominant factor in FC2 with a share of 59.1%, indicating that the performance change is more controlled by the inlet thermal perturbation. In addition, FC1 relies more on gas supply and ventilation efficiency indicators (e.g., TinH2, I), while FC2 focuses more on temperature control and water management indicators (e.g., DWAT). The need to model FC characteristics separately was verified.
3.4. Model Training and Optimization
After selecting the most important features using the Random Forest algorithm, the input features were arranged as multi-dimensional time sequences. A sliding window approach was adopted with step sizes of 5, 10, 15, and 25 to segment the time-series into samples.
The dataset was divided into three parts: 50% for training, 10% for validation, and 40% for testing. A two-layer BiLSTM model enhanced with an attention mechanism was implemented using the Keras framework [
23]. Hyperparameter tuning—including the number of LSTM neurons, dropout probability, and learning rate—was conducted via Optuna, with the mean squared error (MSE) on the validation set serving as the objective function. The model was ultimately trained for 50 epochs using the Adam optimizer, with an early stopping strategy applied to avoid overfitting.
4. Results and Discussion
In both the FC1 and FC2 datasets, 50% of the aging data was allocated for model training, 10% for testing, and the remaining 40% for validation. The Remaining Useful Life (RUL) was defined as the duration from the 550 h point until the PEMFC voltage declined below a specified percentage of its initial value. The degradation thresholds were set to 3.0%, 3.5%, and 4.0% for FC1, and 3.5%, 4.0%, and 4.5% for FC2, respectively.
The model’s prediction performance was assessed using standard regression metrics: Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE), and the coefficient of determination (R
2), as described in [
44]. The corresponding formulas are provided below:
where
is the i-th true value (observed value),
represents the i-th predicted value, and
is the mean of the true values.
4.1. Hyperparametric Optimization Results
In order to enhance the model’s generalization ability and improve the accuracy of its predictions, the Optuna framework is employed to automatically optimize the hyperparameters. Among the hyperparameters optimized, the LSTM unit count—used to determine model complexity—was explored across values of 32, 64, 128, and 256. The dropout rate, designed to mitigate overfitting, was sampled within a continuous range from 0.0 to 0.5 Learning rate (lr), which is used to control the updating step of the optimiser, is searched with a logarithmic uniform distribution, ranging from 1 × 10
−5 to 1 × 10
−3. Each hyperparameter configuration underwent training over 30 epochs, with MSE on the validation set serving as the evaluation criterion, and the objective function is the minimisation of the MSE of the validation set. Step sizes of 5, 10, 15, and 25 were tested, with a trial limit of 30 iterations, and the best-performing parameter set was ultimately selected, as summarized in
Table 3.
The hyperparameter optimization results at different step sizes show that the two datasets FC1 and FC2 exhibit some differences in model configuration. Both of them use 128 LSTM units for short-term prediction (5 and 10 steps), and 256 units for long-term prediction (15 and 25 steps), which indicates that longer time series dependence requires a larger model support. The Dropout of FC1 increases significantly with the step length, and reaches 0.3104 for long-term prediction, reflecting that it is easier to be overfitted, while that of FC2 is lower, indicating that its feature distribution is more stable. The overall Dropout of FC2 is lower, with a smaller variation, indicating that its feature distribution is more stable. In terms of learning rate, FC1 is significantly lower at 15 and 25 steps, which tends to improve the training stability through a smaller learning rate; while FC2 maintains a higher learning rate at all steps, which makes the training process faster and the convergence effect stable. Taken together, FC1 is more sensitive to model complexity and regularization, while FC2 shows better robustness and generalization ability. Once the optimal hyperparameters are determined, the model undergoes retraining for 50 iterations, utilizing both the training and validation datasets. Its final performance is then assessed using the test dataset.
4.2. Long-Term Degradation Trend Prediction Results
To estimate the remaining lifetime of FC1 and FC2, a tailored deep learning framework was adopted. This approach employs a dual-layer BiLSTM network enhanced with an attention mechanism. Key hyperparameters—including the number of LSTM units, dropout rate, and learning rate—were optimized through the Optuna algorithm. The training utilized the Adam optimizer, with the learning rate searched within the interval of 1 × 10−5 to 1 × 10−3. Training was further refined using early stopping and adaptive learning rate techniques, constrained to 50 epochs, a batch size of 32, and a time window of 20. The voltage forecasting was conducted independently for static and dynamic datasets to evaluate the system’s remaining useful life.
The initial voltage of FC1 is 3.313 V. The corresponding voltages for different failure domains are: 3.21361 V (FT: 3.0%), 3.197045 V (FT: 3.5%) and 3.18048 V (FT: 4.0%). The remaining useful lives corresponding to different failure domains since 550 h are: 258.99 h (RUL1), 266.09 h (RUL2) and 338.6 h (RUL3), respectively. The prediction results of FC1 are shown in
Figure 12.
The initial voltage of FC2 is 3.3154 V. The corresponding voltages for different failure domains are: 3.199361 V (FT: 3.5%), 3.182784 V (FT: 4.0%) and 3.166207 V (FT: 4.5%). The remaining useful lives corresponding to the different failure domains since 550 h are 68.74 h (RUL1), 207.55 h (RUL2) and 221.7 h (RUL3), respectively.
Figure 13 illustrates the prediction results obtained for FC2. Both FC1 and FC2 exhibit consistent voltage degradation trends, despite differences in loading conditions (steady vs. quasi-dynamic), validating the generalization capability of the model.
To evaluate how accurately the model predicts, the final RUL estimates were computed following the method described in [
42]:
- (1)
The true RUL and the predicted RUL are expressed as:
where
denotes the true time when the failure domain is first reached,
denotes the initial set prediction time, here
, and
denotes the prediction time when the failure domain is first reached.
- (2)
The percentage error, Er, quantifies the deviation between actual and predicted RUL, and is computed as:
- (3)
RUL Predictive Accuracy () Score:
According to Equation (26), a non-positive error (Er ≤ 0) implies an overestimated RUL, incurring a heavy penalty, whereas a positive error (Er > 0) implies underestimation and results in a milder penalty.
- (4)
The final scores for the different failure domains are calculated as follows.
Here N denotes the number of selected failure domains and for this study the number of FC1 and FC2 failure domains are both 3.
As presented in
Table 4, the proposed model achieves RUL prediction scores of 0.99 on both FC1 and FC2, demonstrating its strong performance in estimating the remaining service life of PEMFCs.
4.3. Model Comparison and Validation
To further assess the predictive performance, this section compares the proposed method with four baseline models—GRU [
45], LSTM, SVR [
46], and LSTM with attention—under varying training step settings.
Figure 14 and
Figure 15 illustrate the prediction results for FC1 and FC2, while
Table 5 and
Table 6 detail model configurations and evaluation metrics.
Figure 16 presents the RMSE and MAPE across different step sizes.
Comparative analysis reveals that the proposed model consistently achieves superior accuracy and robustness. While alternative models suffer from issues such as gradient vanishing or explosion over extended prediction ranges, the proposed approach maintains stable performance. This superiority stems from two key components: (1) the bidirectional LSTM captures both past and future temporal patterns in degradation sequences, enhancing sequence comprehension; and (2) the attention mechanism adaptively emphasizes degradation-critical time windows, improving interpretability and predictive accuracy. These features enable the model to outperform traditional architectures such as SVR and GRU, which lack memory or contextual sensitivity.
4.4. Case Study
This section illustrates a practical application of the proposed two-layer BiLSTM-AT combined with Optuna optimization, employing experimental PEMFC degradation data to validate prediction outcomes.
4.4.1. Description of PEMFC Degradation Data
The experimental data were collected from a laboratory fuel cell platform equipped with a real-time data acquisition system for monitoring polarization behavior. As depicted in
Figure 17, the setup includes a test bench where the fuel cell stack is composed of 36 individual cells, capable of handling up to 1.6 kW of current. In the experiment, temperature probes are installed inside the cooling channels in direct response to the surface temperature of the cooling activity, using a total of three thermocouples, which are inserted at both ends and the center of the stack. The surface of the stack contained cooling fans, and the fan array consisted of upstream and downstream fans with a maximum speed of 6000 rpm and an input power of 4.6 w. At maximum power, the measured wind speed was 2.2 m/s. The FC stack was evaluated under stable operational conditions using a fixed current of 15 A, and the experimental procedure was performed by recording the data at a constant interval, with the voltage data being recorded every 0.8 s.
Table 7 lists the parameters of the test stack and the environmental parameters.
4.4.2. Predicted Results
The collected data were used as inputs to the model for prediction experiments after necessary preprocessing operations (e.g., outlier removal, smoothing filtering, normalization). The model input features include the rate of change in voltage, time-normalized value, offset of voltage with respect to the initial value, sliding mean, s and sliding standard deviation. All the features mentioned are crucial indicators reflecting the degradation behavior of the fuel cell.
Figure 18 presents the initial voltage measurements and corresponding temperature data from various internal locations of the stack.
The processed test dataset was fed into the developed prediction framework, and the corresponding outputs are illustrated in
Figure 19. The experimental results show that the prediction curves of the model on the actual electric reactor test data are highly consistent with the real voltage curves, and can accurately capture the trends of the voltage drop and sudden change phases. In the quasi-static data validation, the model achieves high R
2 coefficients with low RMSE and MAPE values, indicating its good generalization ability and robustness. Although no explicit failure threshold is defined for this dataset, the predicted voltage trends closely match the measured degradation curve, including the sudden drop region. This confirms the model’s ability to capture real-world degradation behavior. In future work, we plan to define voltage thresholds in physical experiments to enable direct RUL quantification.