A Comparison of Different Transformer Models for Time Series Prediction
Abstract
1. Introduction
2. Literature Review
3. Dataset
- Cycle Index tracks the number of completed cycles, enabling analysis of performance degradation over time.
- Voltage and Current Profiles include maximum and minimum voltages during charge and discharge cycles, alongside charging durations and discharge times (Figure 1).
- Charging and Discharging Metrics detail the time spent at specific voltage levels, providing information for understanding the battery efficiency and operational behavior.
- The training set consists of 10,545 samples.
- The testing set consists of 4519 samples.
4. Applied Approaches
4.1. Data Preprocessing
- Feature-Target Separation: The features, including sensor data such as voltage and current, were isolated from the target RUL variable to establish clear input–output relationships, as shown in Table 1 where RUL is the target variable that is put in the final column.
- Scaling:
- -
- MinMax Scaling was applied to the Transformer and SimSiam models, for normalizing feature values to the range [0, 1] to ensure a uniform contribution during training.
- -
- Standard Scaling was used for the CNN–Transformer model to standardize the data with zero mean and unit variance, enhancing the functionality of the convolutional layers.
- Sequence Formation: Overlapping windows of 15 time steps were generated to capture temporal dependencies, providing context for accurate RUL predictions.
- Data Augmentation: Techniques such as Gaussian noise injection and masking were employed to simulate real-world uncertainties and improve the generalization capabilities of the models.Mask Augmentation: It is a data augmentation technique used in machine learning and deep learning, especially for TS data. A binary mask is generated by sampling from a Bernoulli distribution with a fixed masking probability of 0.1. This mask is applied to the input feature matrix , where n is the sequence length and d the number of features. A total of 10% of the values are randomly selected and replaced with zeros. The masking operation is defined in Equation (1).where M is a binary mask matrix with the same shape as X, and the multiplication is applied element-wise. A total of 10% of the entries in M are 1.By simulating partial sensor dropout or data loss, the augmented dataset is created by concatenating the original and masked inputs, along with their labels, effectively doubling the training data size and improving model robustness to missing or corrupted features.Noise Augmentation: This technique adds Gaussian noise with a fixed standard deviation (e.g., 0.01) independently to each feature in the input sequence where n is the sequence length and d the number of features. The noisy input is defined in Equation (2).where is a noise matrix of the same shape as X. Combining the original and augmented data increases dataset size, improving model robustness and generalization by simulating sensor noise and variability in real-world inputs.
4.2. Encoder-Only Transformer Model
- Multi-Head Attention: It facilitates simultaneous focus on various temporal segments, identifying key relationships in the data.
- Feedforward Network: It combines non-linear and linear transformations to uncover complex patterns within the input features.
- Layer Normalization and Dropout: It ensures numerical stability during training and prevents overfitting by regularizing the network.
4.3. Hybrid CNN–Transformer Model
- CNN Component:
- -
- Convolutional Filters: Convolutional filters in CNNs extract localized patterns, such as edges, textures, or features, from the input data. These filters help identify significant trends or anomalies relevant to the task.
- -
- Global Pooling: It reduces dimensionality while preserving essential features, and ensuring compatibility with Transformer layers.
- Transformer Component: It processes extracted spatial features to capture long-term temporal dependencies using self-attention and feedforward layers.
4.4. SimSiam-Based Transfer Learning Model
- Pre-Training: The model learns to match feature representations from different augmented views of the same sequence, encouraging invariance to minor perturbations.
- Cosine Similarity Loss: This loss function guides the pretraining process by maximizing the similarity between paired feature representations while avoiding trivial solutions through gradient control mechanisms.
- Untrainable Backbone: In this configuration, the backbone model remains frozen (untrainable) during the second phase of training. This allows the pre-trained features to remain intact, with the additional layers adapting these features to the RUL prediction task. This approach reduces the risk of overfitting, but may limit the model’s ability to adapt to task-specific details in the labelled data.
- Trainable Backbone: In this configuration, the layers in the backbone model are made trainable during the second phase of training (Supervised training phase). This allows the pre-trained features to be fine-tuned for the RUL prediction task, enhancing the model’s adaptability and alignment with the specific dataset. However, it also introduces a greater risk of overfitting if not managed with care.
4.5. Hyperparameter Tuning
- Attention Heads: Explored between 2 and 8 to capture diverse feature subspaces without excessive computational costs.
- Feedforward Dimensions: Tuned between 128 and 512 units to balance the model’s learning capacity with the risk of overfitting.
- Dropout Rates: Adjusted between 0.01 and 0.3 to enhance generalization while retaining critical features.
- Learning Rates: Evaluated on a logarithmic scale from to to ensure efficient convergence during training.
- Transformer Layers (Hybrid Model): Limited to 1 to 4 layers to capture long-range dependencies while maintaining computational efficiency.
4.6. Theoretical and Mathematical Framework
4.6.1. Theoretical Background of the Encoder-Only Transformer Model
4.6.2. Theoretical Background of the Transformer–CNN Model
5. Experimental Results
5.1. Evaluation Metrics
- Mean Absolute Error (MAE): Measures the average magnitude of errors in predictions. Lower MAE indicates better accuracy.where n is the number of observations, is the actual value and is the predicted value.
- Mean Squared Error (MSE): Focuses on the average squared differences between predicted and actual values. It is sensitive to larger errors.The same metric was used as Validation Loss to quantify the model’s ability to generalize during training.
- Coefficient of Determination (): Represents the proportion of variance in the target variable explained by the model, with values closer to 1 indicating a better fit.
5.2. Results of the Encoder-Only Transformer Model
5.3. Results of the Transformer–CNN Hybrid Model Without Augmentation
5.4. Results of the Transformer-CNN with Mask Augmentation
5.5. Results of the Transformer–CNN Hybrid Model with Noise Augmentation
5.6. Results of the SimSiam Transfer Learning Model
- Untrainable backbone model: The backbone remained untrainable during the second phase of training. Training and validation losses were 903.54 and 716.93, respectively, with MAE values of 19.90 and 16.35. Significant deviations in the predicted versus actual RUL highlight the model’s limitations.
- Adaptable backbone model: In this configuration, the backbone layer of the SimSiam model was allowed to adjust during the second training phase. Training and validation losses reduced to 80.63 and 51.53, respectively, with MAE values of 6.62 and 4.96. Detailed numerical results are provided in the last row of Table 4.Table 5 shows the best values for the hyperparameters of different models. An additional hyperparameter, Number of Layers, was tuned exclusively for the Transformer–CNN models. It defines the number of Transformer encoder layers in the architecture and was not applicable to the Encoder-Only Transformer or SimSiam Transfer Learning models.
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Adithya, C.H.; Hegde, A.R.; Prasad, S. Early Prediction of Remaining Useful Life for Li-ion Batteries Using Transformer Model with Dual Auto-Encoder and Ensemble Techniques. In Proceedings of the 2024 IEEE 9th International Conference for Convergence in Technology (I2CT), Pune, India, 5–7 August 2024; IEEE: New York, NY, USA, 2024; pp. 1–7. [Google Scholar]
- Gu, X.; See, K.W.; Li, P.; Shan, K.; Wang, Y.; Zhao, L.; Lim, K.C.; Zhang, N. A novel state-of-health estimation for the lithium-ion battery using a convolutional neural network and transformer model. Energy 2023, 262, 125501. [Google Scholar] [CrossRef]
- Demirci, O.; Taskin, S.; Schaltz, E.; Demirci, B.A. Review of battery state estimation methods for electric vehicles-Part II: SOH estimation. J. Energy Storage 2024, 96, 112703. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Song, W.; Wu, D.; Shen, W.; Boulet, B. A remaining useful life prediction method for lithium-ion battery based on temporal transformer network. Procedia Comput. Sci. 2023, 217, 1830–1838. [Google Scholar] [CrossRef]
- Kim, G.; Choi, J.G.; Lim, S. Using transformer and a reweighting technique to develop a remaining useful life estimation method for turbofan engines. Eng. Appl. Artif. Intell. 2024, 133, 108475. [Google Scholar] [CrossRef]
- Wang, S.; Li, Y.; Zhou, S.; Chen, L.; Michael, P. Remaining useful life prediction of lithium-ion batteries using a novel particle flow filter framework with grey model. Sci. Rep. 2025, 15, 3311. [Google Scholar] [CrossRef] [PubMed]
- Jiao, R.; Peng, K.; Dong, J. Remaining useful life prediction of lithium-ion batteries based on conditional variational autoencoders-particle filter. IEEE Trans. Instrum. Meas. 2020, 69, 8831–8843. [Google Scholar] [CrossRef]
- Miao, Q.; Xie, L.; Cui, H.; Liang, W.; Pecht, M. Remaining Useful Life Prediction of Lithium-Ion Battery with Unscented Particle Filter Technique. Microelectron. Reliab. 2013, 53, 805–810. [Google Scholar] [CrossRef]
- Qiu, X.; Wu, W.; Wang, S. Remaining Useful Life Prediction of Lithium-Ion Battery Based on Improved Cuckoo Search Particle Filter and a Novel State of Charge Estimation Method. J. Power Sources 2020, 450, 227700. [Google Scholar] [CrossRef]
- Wang, S.; Han, W.; Chen, L.; Zhang, X.; Pecht, M. Experimental Verification of Lithium-Ion Battery Prognostics Based on an Interacting Multiple Model Particle Filter. Trans. Inst. Meas. Control 2020, 42, 01423312. [Google Scholar] [CrossRef]
- Hu, Q.; Zhang, R.; Zhou, Y. Transfer learning for short-term wind speed prediction with deep neural networks. Renew. Energy 2016, 85, 83–95. [Google Scholar] [CrossRef]
- Taherkhani, A.; Cosma, G.; Alani, A.A.; McGinnity, T.M. Activity Recognition from Multi-modal Sensor Data Using a Deep Convolutional Neural Network. In Intelligent Computing; SAI 2018; Advances in Intelligent Systems and Computing; Arai, K., Kapoor, S., Bhatia, R., Eds.; Springer: Cham, Switzerland, 2019; Volume 857. [Google Scholar] [CrossRef]
- Fulsunder, S.; Umar, S.; Taherkhani, A.; Liu, C.; Yang, S. Hand Gesture Recognition Using a Multi-modal Deep Neural Network. In Intelligent Information Processing XII; IIP 2024; IFIP Advances in Information and Communication Technology; Shi, Z., Torresen, J., Yang, S., Eds.; Springer: Cham, Switzerland, 2024; Volume 704. [Google Scholar] [CrossRef]
- Malhotra, P.; Vig, L.; Agarwal, P.; Shroff, G. TimeNet: Pre-trained deep recurrent neural network for time series classification. In Proceedings of the 25th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Bruges, Belgium, 26–28 April 2017. [Google Scholar]
- Mehdiyev, N.; Lahann, J.; Emrich, A.; Enke, D.; Fettke, P.; Loos, P. ScienceDirect ScienceDirect time series classification using deep learning for process planning: A case from the process industry. Proc. Comput. Sci. 2017, 114, 242–249. [Google Scholar] [CrossRef]
- Rajan, D.; Thiagarajan, J.J. A Generative Modeling Approach to Limited Channel ECG Classification. In Proceedings of the 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Honolulu, HI, USA, 18–21 July 2018; pp. 2571–2574. [Google Scholar]
- Aswolinskiy, W.; Reinhart, R.F.; Steil, J. Time series classification in reservoir- and model-space: A comparison. In Artificial Neural Networks in Pattern Recognition; Schwenker, F., Abbas, H.M., El Gayar, N., Trentin, E., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 197–208. [Google Scholar]
- Bianchi, F.M.; Scardapane, S.; Jenssen, R. Reservoir computing approaches for representation and classification of multivariate time series. arXiv 2018, arXiv:1803.07870. [Google Scholar] [CrossRef] [PubMed]
- Chouikhi, N.; Ammar, B.; Alimi, A.M.; Member, S. Genesis of basic and multi-layer echo state network recurrent autoencoder for efficient data representations. arXiv 2018, arXiv:1804.08996. [Google Scholar] [CrossRef]
- Ma, Q.; Shen, L.; Chen, W.; Wang, J.; Wei, J.; Yu, Z. Functional echo state network for time series classification. Inf. Sci. 2016, 373, 1–20. [Google Scholar] [CrossRef]
- Taherkhani, A.; Cosma, G.; McGinnity, T.M. A Deep Convolutional Neural Network for Time Series Classification with Intermediate Targets. SN Comput. Sci. 2023, 4, 832. [Google Scholar] [CrossRef]
- Antoniades, A.; Spyrou, L.; Martin-Lopez, D.; Valentin, A.; Alarcon, G.; Sanei, S.; Took, C.C. Detection of interictal discharges with convolutional neural networks using discrete ordered multichannel intracranial EEG. IEEE Trans. Neural Syst. Rehabil. Eng. 2017, 4320, 1–10. [Google Scholar] [CrossRef] [PubMed]
- Chen, D.; Hong, W.; Zhou, X. Transformer network for remaining useful life prediction of lithium-ion batteries. IEEE Access 2022, 10, 19621–19628. [Google Scholar] [CrossRef]
- Costa, N.; Sánchez, L.; Anseán, D.; Dubarry, M. Li-ion battery degradation modes diagnosis via Convolutional Neural Networks. J. Energy Storage 2022, 55, 105558. [Google Scholar] [CrossRef]
- Gui, X.; Du, J.; Song, L.; Yan, Z.; Guo, L. A Transformer-CNN Based Transferable Model for State-of-Health Prediction of Lithium-ion Batteries. In Proceedings of the 2023 Global Reliability and Prognostics and Health Management Conference (PHM-Hangzhou), Hangzhou, China, 12–15 October 2023; IEEE: New York, NY, USA, 2023; pp. 1–8. [Google Scholar]
- GitHub. Battery RUL Prediction Dataset. Available online: https://github.com/ignavinuales/Battery_RUL_Prediction/blob/main/Datasets/HNEI_Processed/Final%20Database.csv (accessed on 7 August 2025).
- Chen, X.; He, K. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15750–15758. [Google Scholar]





| Cycle Index | Discharge Time (s) | Max. Voltage Discharge (V) | … | RUL |
|---|---|---|---|---|
| 1 | 2595.30 | 3.67 | … | 1112 |
| 2 | 7408.64 | 4.246 | … | 1111 |
| 3 | 7393.76 | 4.249 | … | 1110 |
| 4 | 7385.50 | 4.250 | … | 1109 |
| Sample | Actual RUL | Predicted RUL |
|---|---|---|
| 1 | 63.00 | 63.89 |
| 2 | 874.00 | 879.69 |
| 3 | 891.00 | 891.67 |
| 4 | 301.00 | 296.32 |
| 5 | 345.00 | 342.22 |
| Model | Loss (MSE) | MAE | R2 Score |
|---|---|---|---|
| Transformer–CNN (Without Augmentation) | 9.78 | 2.34 | 0.99985 |
| Transformer–CNN (Mask Augmentation) | 4.31 | 1.39 | 0.999958 |
| Transformer–CNN (Noise Augmentation) | 3.47 | 1.22 | 0.999966 |
| Model | Loss (MSE) | MAE | R2 Score |
|---|---|---|---|
| Encoder-Only Transformer | 41.07 | 3.99 | 0.99960 |
| Transformer–CNN (Without Aug.) | 9.78 | 2.34 | 0.99985 |
| Transformer–CNN (Mask Aug.) | 4.31 | 1.39 | 0.99996 |
| Transformer–CNN (Noise Aug.) | 3.47 | 1.22 | 0.99997 |
| SimSiam Transfer Learning (Untrainable) | 716.93 | 16.35 | 0.99020 |
| SimSiam Transfer Learning (Adaptable) | 51.53 | 4.96 | 0.99910 |
| Model | Number of Heads | Feed Forward Network Dimensionality (dff) | Dropout Rate | Learning Rate |
|---|---|---|---|---|
| Encoder-Only Transformer | 2 | 512 | 0.01 | 0.00103 |
| Transformer–CNN (Without Aug.) | 2 | 256 | 0.01 | 0.00045 |
| Transformer–CNN (Mask Aug.) | 2 | 384 | 0.01 | 0.00075 |
| Transformer–CNN (Noise Aug.) | 2 | 256 | 0.01 | 0.00055 |
| SimSiam Transfer Learning (Untrainable) | 4 | 512 | 0.02 | 0.04500 |
| SimSiam Transfer Learning (Adaptable) | 4 | 512 | 0.02 | 0.01500 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Capoglu, E.U.; Taherkhani, A. A Comparison of Different Transformer Models for Time Series Prediction. Information 2025, 16, 878. https://doi.org/10.3390/info16100878
Capoglu EU, Taherkhani A. A Comparison of Different Transformer Models for Time Series Prediction. Information. 2025; 16(10):878. https://doi.org/10.3390/info16100878
Chicago/Turabian StyleCapoglu, Emek Utku, and Aboozar Taherkhani. 2025. "A Comparison of Different Transformer Models for Time Series Prediction" Information 16, no. 10: 878. https://doi.org/10.3390/info16100878
APA StyleCapoglu, E. U., & Taherkhani, A. (2025). A Comparison of Different Transformer Models for Time Series Prediction. Information, 16(10), 878. https://doi.org/10.3390/info16100878

